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DISCRETE DISTRIBUTION 


DISCRETE DISTRIBUTION 


RANDOM VARIABLE OF DISCRETE TYPE 


A SAMPLE SPACE S may be difficult to describe if the elements of S are not numbers. Let discuss how one can 
use a rule by which each simple outcome of a random experiment, an element s of S, may be associated with a real 
number x. 


DEFINITION OF RANDOM VARIABLE 
Given a random experiment with a sample space S, a function X that assigns to each element s in S one and 
only one real number X (s) = z is called a random variable. The space of X is the set of real numbers 
{a : «= X(s),s € S}, where s belongs to S means the element s belongs to the set S. 
It may be that the set S has elements that are themselves real numbers. In such an instance we could write 
X (s) = s so that X is the identity function and the space of X is also S. This is illustrated in the example 
below. 


Example: 

Let the random experiment be the cast of a die, observing the number of spots on the side facing up. The sample 
space associated with this experiment is S = (1,2,3,4,5,6) . For each s belongs to S, let X (s) = s . The space of 
the random variable X is then {1,2,3,4,5,6}. 

If we associate a probability of 1/6 with each outcome, then, for example, 

P(X =5) =1/6,P(2 < X <5) =4/6, ands belongs to S seem to be reasonable assignments, where 

(2 < X <5) means (X = 2,3,4 or 5) and (X < 2) means (X = 1 or 2), in this example. 


We can recognize two major difficulties: 


1. In many practical situations the probabilities assigned to the event are unknown. 
2. Since there are many ways of defining a function X on S, which function do we want to use? 


Let X denotes a random variable with one-dimensional space R, a subset of the real numbers. Suppose that the 
space R contains a countable number of points; that is, R contains either a finite number of points or the points of 
R can be put into a one-to- one correspondence with the positive integers. Such set R is called a set of discrete 
points or simply a discrete sample space. 


Furthermore, the random variable X is called a random variable of the discrete type, and X is said to have a 
distribution of the discrete type. For a random variable X of the discrete type, the probability P (X = z) is 
frequently denoted by f(x), and is called the probability density function and it is abbreviated p.d.f.. 


Let f(x) be the p.d.f. of the random variable X of the discrete type, and let R be the space of X. Since, 

f (x) = P(X = 2) , x belongs to R, f(x) must be positive for x belongs to R and we want all these probabilities 
to add to 1 because each P (X = 2) represents the fraction of times x can be expected to occur. Moreover, to 
determine the probability associated with the event A C R, one would sum the probabilities of the x values in A. 
That is, we want f(x) to satisfy the properties 


» P(X=2), 


- So f(z) 


zeR 


1; 


P(X Ee A)= So f(z) , where A C R. 


zeA 


Usually let f (2) = 0 when x ¢ FR and thus the domain of f(x) is the set of real numbers. When we define the 
p.d.f. of f(x) and do not say zero elsewhere, then we tacitly mean that f(x) has been defined at all x’s in space R, 
and it is assumed that f (x) = 0 elsewhere, namely, f (x) = 0, x ¢ R. Since the probability 

P(X =) = f (x) > 0 whenz € R and since R contains all the probabilities associated with X, R is sometimes 
referred to as the support of X as well as the space of X. 


Example: 

Roll a four-sided die twice and let X equal the larger of the two outcomes if there are different and the common 
value if they are the same. The sample space for this experiment is S = [(di, d2) : dy = 1,2,3,4; d, = 1,2,3,4] , 
where each of this 16 points has probability 1/16. Then P(X = 1) = P[(1,1)] =1/16, 

P(X = 2) = P)\(1,2), (2,1); (2,2) = 3/16 , and similarly P(X — 3)/— 5/16 and P(X — 4) — 7/16. That 
is, the p. d.f. of X can be written simply as f(z) = P(X =a) = a pte = Lyk 

We could add that f (2) = 0 elsewhere; but if we do not, one should take f(x) to equal zero when x ¢ R. 


A better understanding of a particular probability distribution can often be obtained with a graph that depicts the 
p.d.f. of X. 


Note: the graph of the p.d.f. when f (a) > 0, would be simply the set of points {[z, f (x)] : 2 € R }, where R is 
the space of X. 


Two types of graphs can be used to give a better visual appreciation of the p.d.f., namely, a bar graph and a 
probability histogram. A bar graph of the p.d.f. f(x) of the random variable X is a graph having a vertical line 
segment drawn from (2,0) to [x, f (a)] at each x in R, the space of X. If X can only assume integer values, a 
probability histogram of the p.d.f. f(x) is a graphical representation that has a rectangle of height f(x) and a base 
of length 1, centered at x, for each x € R, the space of X. 


CUMULATIVE DISTRIBUTION FUNCTION 
Let X be a random variable of the discrete type with space R and p.d.f. f(z) = P(X =2x),z€ R.Now 
take x to be a real number and consider the set A of all points in R that are less than or equal to x. That is, 
A=(t:t<a)andte R. 
Let define the function F(x) by 
Equation: 


E(t) = Pe S28) =) FO): 


teA 


The function F(x) is called the distribution function (sometimes cumulative distribution function) of the 
discrete-type random variable X. 


Several properties of a distribution function F(x) can be listed as a consequence of the fact that probability must be 
a value between 0 and 1, inclusive: 


¢ 0 < F(x) < 1 because F(x) is a probability, 

e F(x) is a nondecreasing function of x, 

¢ F(y) =1, where y is any value greater than or equal to the largest value in R; and F(z) = 0, where z is 
any value less than the smallest value in R; 


e If X is arandom variable of the discrete type, then F(x) is a step function, and the height at a step at x, x € R, 
equals the probability P(X = x). 


Note: It is clear that the probability distribution associated with the random variable X can be described by either 
the distribution function F(x) or by the probability density function f(x). The function used is a matter of 
convenience; in most instances, f(x) is easier to use than F(x). 


Graphical representation of the relationship between p.d.f. and c.d.f. 
A(x) 4 


Random Variable X 


F(x) a 


F(a)=P(X<=a) 


a x 


Area under p.d.f. curve to a equal to a 
value of c.d.f. curve at a point a. 


MATHEMATICAL EXPECTATION 


If £(x) is the p.d.f. of the random variable X of the discrete type with space R and if the summation 
Equation: 


do u(e)f (2) = do u(2)f (2) 


R zeR 


exists, then the sum is called the mathematical expectation or the expected value of the function u(X), and 
it is denoted by E [u (X)] . That is, 
Equation: 


We can think of the expected value E [u (X)] as a weighted mean of u(x), « € R, where the weights are the 
probabilities f(z) = P(X =2). 


Note:The usual definition of the mathematical expectation of u(X) requires that the sum converges absolutely; 


that is, S> |u (x)|f (x) exists. 


on 


There is another important observation that must be made about consistency of this definition. Certainly, this 
function u(X) of the random variable X is itself a random variable, say Y. Suppose that we find the p.d.f. of Y to 
be g(y) on the support R, . Then E(Y) is given by the summation SS yg (y) 

yERi 


In general it is true that 


So u(2)Ff (x) = 5 9 (y); 


R yERi 


that is, the same expectation is obtained by either method. 


Example: 

Let X be the random variable defined by the outcome of the cast of the die. Thus the p.d-f. of X is 
7@)= = fo 1.8) ALIS (8), 

In terms of the observed value x, the function is as follows 


Teeny in? 
UZ) =< Se =45, 
S148 =O, 


The mathematical expectation is equal to 
Equation: 


Example: 
Let the random variable X have the p.d.f. f (a) = = x € R, where R ={-1,0,1}. Let u(X) = X?. Then 


Equation: 
Sof (2) = ( 1)" (=) , 0) (5) -@?(Z) - 2 


zeR 


However, the support of random variable Y = X? is R; = (0,1) and 


That is, 


and R;. Hence 


1 2 
DS yg (y) =0 (4) aE il (3) , which illustrates the preceding observation. 
yERi 


When it exists, mathematical expectation E satisfies the following properties: 


1. If c is a constant, E(c)=c, 
2. If ¢ is a constant and u is a function, EF [cu (X)| = cE [u(X)], 


3. If cy and c2 are constants and uw; and wz are functions, then 
E [cyu4 (xX) + C2U2 (X)| = cE [wi (Xx)| + cok [w2 (X)| 


First, we have for the proof of (1) that 
E(c)=) lef (2) =c) f(z) =e 
R R 
because So f(a) =. 

R 


Next, to prove (2), we see that 


E(cu(X)] =) cu(2)f (2) =c>)u(2)f(e) = cE [u(X)). 


R R 


Finally, the proof of (3) is given by 


E [eyus (X) + extuz (X)] = SO [eve (@) + cots (2) (@) = So crus (a) f (x) + D0 cous (2) f (2). 


R R R 
By applying (2), we obtain 
E [cyu1 (X) + cgue (X)| = 1 EB [ui (x)] + oF [ue (2)]. 
Property (3) can be extended to more than two terms by mathematical induction; That is, we have 
k k 
3.E bs Ci «) = S 0 ¢E [uj (X)]. 
i=l 


Because of property (3’), mathematical expectation E is called a linear or distributive operator. 


Example: 
Let X have the p.d.f. f (7) = eo bee 
then 


) +2 (qo) +3 (a0) +4 (an) =3 


8 
a, 
5|8 
Se 

I 
4 
— 
B|- 
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MATHEMATICAL EXPECTATION 


MATHEMATICAL EXPECTIATION 


MATHEMATICAL EXPECTIATION 
If f (x) is the p.d.f. of the random variable X of the discrete type with space R and if the summation 


Equation: 


exists, then the sum is called the mathematical expectation or the expected value of the function u (X) , 
and it is denoted by E [u (a)] . That is, 
Equation: 


We can think of the expected value E [u (x)| as a weighted mean of u(x) , « € R, where the weights are 
the probabilities f (x) = P(X = 2). 


Note: The usual definition of the mathematical expectation of u (X) requires that the sum converges 


absolutely; that is, Sa |u (x)|f (x) exists. 
xeR 


There is another important observation that must be made about consistency of this definition. Certainly, 
this function u (X) of the random variable X is itself a random variable, say Y. Suppose that we find the 


p.d.f. of Y to be g (y) on the support Ry . Then, & (Y) is given by the summation > yg (y) . 
yeR, 


In general it is true that S ula)t a)\= ss yg (y). 
R 


yER, 


This is, the same expectation is obtained by either method. 


Example: 
Let X be the random variable defined by the outcome of the cast of the die. Thus the p.d.f. of X is 
f(z) =1,2=1,2,3,4,5,6. 


@ ’ 
In terms of the observed value x, the function is as follows 
ip = Pas, 
it a) eee ereeamer 
S42 = @, 


The mathematical expectation is equal to 


6 


Slu(a)f(e)=1 ¢ +1 G +1 G +5 G +5 | +35 | =1 ZG +5 
Fea | 
Example: 


Let the random variable X have the p.d-f. 
f(z)= 7, ce R, 
where, R = (—1,0,1) . Letu(X) = X?. Then 


1 1 1 2 
2 er ea heel ihe TR eer ae 
NAO (iSO) yr a 
However, the support of random variable Y = X? is Ry = (0,1) and 
(OGY Sel iC oa (les = 
= = PG ee een ag 
P(Y=1)=P(X=-1)+P(X=1l=34+35=3- 
+ y= 
Thatis,g(y)= 5 "and R, = (0,1) . Hence 
394-4; 
1 2 Z, 
Fs we ete ae 
y 7 99 (y) Sous eee 


yEeR, 


which illustrates the preceding observation. 


When it exists, mathematical expectation E satisfies the following properties: 


1. If c is a constant, EF (c) = c, 
2. If c is a constant and w is a function, E [cu (X)| = cE [u(X)], 


3. If cy and cy are constants and wu, and wz are functions, then 
BE [c1Uy (X) + C9U2 (X)] =cE [uy (X)] +oH [wo (X)]. 


First, we have for the proof of (1) that 
E(c)=) i cf(z)=c> f(z) =, 
R R 
because a f () Sd. 
R 


Next, to prove (2), we see that 


E[cu(X)] =} cu(2)f (2) =¢) | u(a)f (2) = cE [u(X)]. 


R R 


Finally, the proof of (3) is given by 


E [ceyu1 (X) + cou2 (X)| = ys [crt (x) + coue (x)| f (x) = 3 city (x) f (x) + So cous (x) f (x). 


R R 


By applying (2), we obtain 


E [cyuy (X) + cue (X)| = c1E [ui (x)| + coF [ue (2)). 
Property (3) can be extended to more than two terms by mathematical induction; that is, we have (3') 
k k 
E So cu; (X) = >> ¢E[u;(X)]. 
i=1 i=1 


Because of property (3’), mathematical expectation E is called a linear or distributive operator. 


Example: 
Let X have the p.d.f. f (x) = 4, = 1, 2,3, 4, then 
: r t 2 3 4 
B(x) =) 12 (aq) = Pe ae He ae 
E X? Sy ier de amare nt ee eetp 
val 10 10 0 10 
and 
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The MEAN, VARIANCE, and STANDARD DEVIATION 


MEAN and VARIANCE 


Certain mathematical expectations are so important that they have special names. In 
this section we consider two of them: the mean and the variance. 


Mean Value 


If X is a random variable with p.d.f. f (a) of the discrete type and space R= 

(b1, bo, bs,...), then B(X) = Saf (x) = bif (b1) + bof (b2) + bs f (b3) +... is 
R 

the weighted average of the numbers belonging to R, where the weights are given by 


the p.d.f. f (x). 


We call  (X) the mean of X (or the mean of the distribution) and denote it by p. 
That is, uw = F(X). 


Note: In mechanics, the weighted average of the points bj, bg, b3,... in one- 
dimensional space is called the centroid of the system. Those without the mechanics 
background can think of the centroid as being the point of balance for the system in 
which the weights f (b1), f (b2), f (b3),.-. are places upon the points bj, be, b3..... 


Example: 
Let X have the p.d-.f. 
f (x) oe 
G = 
roe: 
The mean of X is 
1 3 3 5) 
=X 0 + 1 + 2 + 3 =—., 
es 8 8 2 


The example below shows that if the outcomes of X are equally likely (i.e., each of 
the outcomes has the same probability), then the mean of X is the arithmetic average 
of these outcomes. 


Example: 
Roll a fair die and let X denote the outcome. Thus X has the p.d.f. 


1 
ae) = eves I, 2;354,.9, 6. 


Then, 


which is the arithmetic average of the first six positive integers. 


Variance 


It was denoted that the mean u = F(X) is the centroid of a system of weights of 
measure of the central location of the probability distribution of X. A measure of the 
dispersion or spread of a distribution is defined as follows: 


If u(x) = (x —p)? and E (x - | exists, the variance, frequently denoted by 


o” or Var (X), of arandom variable X of the discrete type (or variance of the 
distribution) is defined by 
Equation: 


o? = B|(X—p)"] =) (eu)? F (2). 
R 
The positive square root of the variance is called the standard deviation of X and is 


denoted by 
Equation: 


o = 4/Var(X) = B|(X- p)’]. 


Example: 
Let the p.d.f. of X by defined by 


f(x) = are ieee: 


The mean of X is 


To find the variance and standard deviation of X we first find 


1 2 3 
Baxkev eae = Do 92) Ee eG: 
6 a 6 a5 6 


Thus the variance of X is 


ot =F xX? = =6 = 


and the standard deviation of X is 


ey 
= /y = 0.745. 


Example: 
Let X be a random variable with mean jz, and variance Ge Of course, Y = aX + B, 
where a and b are constants, is a random variable, too. The mean of Y is 


py = E(Y) = (aX +b) = ak (X)+b=apx +b. 


Moreover, the variance of Y is 


GB) l(v — py)? | = [(ax + b— aux — b)?| = [a(x — px)’ | =a’o4l. 


Moments of the distribution 


Let r be a positive integer. If 


exists, it is called the rth moment of the distribution about the origin. The 
expression moment has its origin in the study of mechanics. 


In addition, the expectation 
E((X—})"]=)02"f(z) 
R 


is called the rth moment of the distribution about b. For a given positive integer r. 
EX), | =H AX ad) (X= 2) ee ee ed) 


is called the rth factorial moment. 


Note: The second factorial moment is equal to the difference of the second and first 
moments: 


E/X(X-1)|)=E X* —E(X). 


There is another formula that can be used for computing the variance that uses the 
second factorial moment and sometimes simplifies the calculations. 


First find the values of F(X) and E [|X (X — 1)]. Then 
o” = E[X(X -1)]+ E(X) -[E(X)]’, 
since using the distributive property of E, this becomes 


C22? LSE MOLER HBOS xX Hq. 


Example: 
Let continue with example 4, it can be find that 


E[X(X —1)] =1(0) z +2(1) - +3(2) < ==. 
Thus 
o = BIX(X-1]+8(X)-[B(xP= 45-2 =2 


Note: Recall the empirical distribution is defined by placing the weight (probability) 
of 1/n on each of n observations x1, £2,...,£n. Then the mean of this empirical 
distribution is 


The symbol z represents the mean of the empirical distribution. It is seen that z is 
usually close in value to . = F(X); thus, when yu is unknown, « will be used to 
estimate pL. 


Similarly, the variance of the empirical distribution can be computed. Let v denote 
this variance so that it is equal to 


2 
. 1 »,a1 2 lwo. 2 
v=) (a;-2) —=)5 a a SY f—i2 
n 4 n ne 
=l1 i=l i=1 


This last statement is true because, in general, 


gE XO yr, 


Note: There is a relationship between the sample variance s? and variance v of the 
empirical distribution, namely s*? = ns/(n — 1). Of course, with large n, the 
difference between s? and v is very small. Usually, we use s? to estimate 0? when o? 
is unknown. 


Note: BERNOULLI TRIALS and BINOMIAL DISTRIBUTION 


BERNOULLI TRIALS and the BINOMIAL DISTRIBUTION 


BERNOULLI TRIALS AND THE BINOMIAL DISTRIBUTION 


A Bernoulli experiment is a random experiment, the outcome of which can be classified in but one of two 
mutually exclusive and exhaustive ways, mainly, success or failure (e.g., female or male, life or death, 
nondefective or defective). 


A sequence of Bernoulli trials occurs when a Bernoulli experiment is performed several independent times so 
that the probability of success, say, p, remains the same from trial to trial. That is, in such a sequence we let p 
denote the probability of success on each trial. In addition, frequently g = 1 — p denote the probability of 
failure; that is, we shall use q and 1 — p interchangeably. 


Bernoulli distribution 
Let X be a random variable associated with Bernoulli trial by defining it as follows: 
X(success)=1 and X(failure)=0. 


That is, the two outcomes, success and failure, are denoted by one and zero, respectively. The p.d.f. of X can be 
written as 
Equation: 


f(z) =p?(1—p)*”, 


and we say that X has a Bernoulli distribution. The expected value of is 
Equation: 


w= E(X)= 5° ap*(1— p)* * = (0) (1—p) + (1) (p) =p, 
x=0 


and the variance of X is 
Equation: 


1 


o” = Var(X) = 5° (a —p)’p*(1—p)** =p? (1—p) + (1—p)’p = p(1— p) = pa. 
z=0 


It follows that the standard deviation of Xiso = p(1—p) = ./pq. 


In a sequence of n Bernoulli trials, we shall let X; denote the Bernoulli random variable associated with the ith 
trial. An observed sequence of n Bernoulli trials will then be an n-tuple of zeros and ones. 


Binomial Distribution 


In a sequence of Bernoulli trials we are often interested in the total number of successes and not in the order of 
their occurrence. If we let the random variable X equal the number of observed successes in n Bernoulli trials, 
the possible values of X are 0,1,2,...,m. If x success occur, where z = 0,1,2,...,n , then n-x failures occur. The 
number of ways of selecting x positions for the x successes in the x trials is 


ni 


a a!(n—2)! 


Since the trials are independent and since the probabilities of success and failure on each trial are, respectively, 
p and g = 1 — p, the probability of each of these ways is p*(1 — p)” *.. Thus the p.d.f. of X, say f (x), is the 


n 
sum of the probabilities of these mutually exclusive events; that is, 
x 


f(e) = p*(1 —p)""?,@ = 0,1,2,...,n. 


These probabilities are called binomial probabilities, and the random variable X is said to have a binomial 
distribution. 
Summarizing, a binomial experiment satisfies the following properties: 


1. A Bernoulli (success-failure) experiment is performed n times. 

2. The trials are independent. 

3. The probability of success on each trial is a constant p; the probability of failure is g = 1 — p. 
4. The random variable X counts the number of successes in the n trials. 


A binomial distribution will be denoted by the symbol b (n, p) and we say that the distribution of X is b(n, p) . 
The constants n and p are called the parameters of the binomial distribution, they correspond to the number 
n of independent trials and the probability p of success on each trial. Thus, if we say that the distribution of X is 
b (12, 14), we mean that X is the number of successes in n =12 Bernoulli trials with probability p = + of 
success on each trial. 


Example: 
In the instant lottery with 20% winning tickets, if X is equal to the number of winning tickets among n =8 that 
are purchased, the probability of purchasing 2 winning tickets is 


f(Q)=P(X=2)= : (0.2)7(0.8)° = 0.2936. 


The distribution of the random variable X is b (8,0.2) . 


Example: 

Leghorn chickens are raised for lying eggs. If p =0.5 is the probability of female chick hatching, assuming 
independence, the probability that there are exactly 6 females out of 10 newly hatches chicks selected at 
random is 


TOY lags: 

A o Gi = PX J 6) — PES & 5) = 0.8281 — 0.6230 = 0.2051. 
Since 

PAX < 6) = 0.8281 


and 


P(X <5) = 0.6230, 


which are tabularized values, the probability of at least 6 females chicks is 


10 oe 10—z 
Oooh fencer 

2 = = =1-— P(X <5) =1-—0.6230 = 0.3770. 
6h 


Example: 

Suppose that we are in those rare times when 65% of the American public approve of the way the President of 
The United states is handling his job. Take a random sample of n =8 Americans and let Y equal the number 
who give approval. Then the distribution of Y is b (8, 0.65). To find 


PYG) 
note that 
PY 2G) = PG a <8 6 a), 
where 
SB = iF 


counts the number who disapprove. Since g = 1 — p = 0.35 equals the probability if disapproval by each 
person selected, the distribution of X is 6 (8,0.35). From the tables, since 


P(X < 2) = 0.4278 
it follows that 
P(Y > 6)0.4278. 
Similarly, 
PY <5) =] P(8—-¥ 285) = P(X S32) ] 9 P (Xx 2) SS 1 0.4378 = 0:5 722 
and 


P(Y =5) =P(8—-Y =8—5) = P(X =3) = P(X <3) — P(X < 2) = 0.7064 — 0.4278 = 0.2786. 


Note: if n is a positive integer, then 


Thus the sum of the binomial probabilities, if we use the above binomial expansion with b = panda = 1—p 
, 1s 


n 


by ‘ p’(1—p)" *=[(1—p) +p)" =1, 
= 


A result that had to follow from the fact that f (a) is a p.d.f. We use the binomial expansion to find the mean 
and the variance of the binomial random variable X that is b(n, p) . The mean is given by 
Equation: 


n 


= E(X) = Sea ley 


ni 
== ee (n — x) 


Since the first term of this sum is equal to zero, this can be written as 
Equation: 


“ n! x n-x 
t Gaia 


because x/a! = 1/ (x — 1)! whenaz > 0. 


To find the variance, we first determine the second factorial moment FE [X (X — 1)]: 
Equation: 


BIX(X-D)=ee- a er - 7). 


The first two terms in this summation equal zero; thus we find that 
BIX(x-1]-Y de leap 
ROMs Yea eae 


After observing that x (2 — 1)/a! = 1/ (@ — 2)! when > 1. Letting k = x — 2, we obtain 


E[x(x-v)=S- moyen — py = n(n-Vp? Se : ee 5 pea — py. 
z=0 " : 2=0) 7" . 


Since the last summanzd is that of the binomial p.d.f. b (n — 2,p) , we obtain 
E[X (X —1)] =n(n—1)p’. 
Thus, 
o? =Var(X)=E X? —[E(X)) = E[X(X-1))+ E(X) -[E(X)) 
= n(n —1)p? + np — (np)’ = —np? + np = np (1 — p). 


Summarizing, 


if X is b(n, p) , we obtain 


= np, o” = np(1—p) = npq,o = (np (1 p). 


Note: When p is the probability of success on each trial, the expected number of successes in n trials is np, a 
result that agrees with most of our intuitions. 


GEOMETRIC DISTRIBUTION 


GEOMETRIC DISTRIBUTION 


To obtain a binomial random variable, we observed a sequence of n Bernoulli trials and 
counted the number of successes. Suppose now that we do not fix the number of Bernoulli 
trials in advance but instead continue to observe the sequence of Bernoulli trials until a certain 
number r, of successes occurs. The random variable of interest is the number of trials 
needed to observe the rth success. 


Let first discuss the problem when r =1. That is, consider a sequence of Bernoulli trials with 
probability p of success. This sequence is observed until the first success occurs. Let X denot 
the trial number on which the first success occurs. 

For example, if F and S represent failure and success, respectively, and the sequence starts 


with F,F,F,S,..., then X =4. Moreover, because the trials are independent, the probability of 
such sequence is 


In general, the p.d.f. , of X is given by ; 
because there must be x -1 failures before the first success that occurs on trail x. We say that X 
has a geometric distribution. 


Note: for a geometric series, the sum is given by 


when 


Thus, 


so that does satisfy the properties of a p.d.f.. 


From the sum of geometric series we also note that, when k is an integer, 


and thus the value of the distribution function at a positive integer k is 


Example: 

Some biology students were checking the eye color for a large number of fruit flies. For the 
individual fly, suppose that the probability of white eyes is _—_ and the probability of red eyes 
is, and that we may treat these flies as independent Bernoulli trials. The probability that at 
least four flies have to be checked for eye color to observe a white-eyed fly is given by 


The probability that at most four flies have to be checked for eye color to observe a white- 
eyed fly is given by 


The probability that the first fly with white eyes is the fourth fly that is checked is 


It is also true that 


In general, 


To find a mean and variance for the geometric distribution, let use the following results about 
the sum and the first and second derivatives of a geometric series. For , let 


Then 


and 

If X has a geometric distribution and , then the mean of X is given by 
Equation: 

using the formula for with and 


Note: for example, that if p =1/4 is the probability of success, then 


trials are needed on the average to observe a success. 


To find the variance of X, let first find the second factorial moment . We have 


Using formula for with and . Thus the variance of X is 


The standard deviation of X is 


POISSON DISTRIBUTION 


POISSON DISTRIBUTION 


Some experiments results in counting the number of times particular events occur in given times of on 
given physical objects. For example, we would count the number of phone calls arriving at a switch 
board between 9 and 10 am, the number of flaws in 100 feet of wire, the number of customers that arrive 
at a ticket window between 12 noon and 2 pm, or the number of defects in a 100-foot roll of aluminum 
screen that is 2 feet wide. Each count can be looked upon as a random variable associated with an 
approximate Poisson process provided the conditions in the definition below are satisfied. 


POISSON PROCCESS 
Let the number of changes that occur in a given continuous interval be counted. We have an 
approximate Poisson process with parameter A > 0 if the following are satisfied: 


1. The number of changes occurring in nonoverlapping intervals are independent. 

2. The probability of exactly one change in a sufficiently short interval of length h is approximately 
Ah. 

3. The probability of two or more changes in a sufficiently short interval is essentially zero. 


Suppose that an experiment satisfies the three points of an approximate Poisson process. Let X denote 
the number of changes in an interval of "length 1" (where "length 1" represents one unit of the quantity 
under consideration). We would like to find an approximation for P (X = z) , where x is a nonnegative 
integer. To achieve this, we partition the unit interval into n subintervals of equal length 1/n. If N is 
sufficiently large (i-e., much larger than x), one shall approximate the probability that x changes occur in 
this unit interval by finding the probability that one change occurs exactly in each of exactly x of these n 
subintervals. The probability of one change occurring in any one subinterval of length 1/n is 
approximately 4 (1/n) by condition (2). The probability of two or more changes in any one subinterval 
is essentially zero by condition (3). So for each subinterval, exactly one change occurs with a probability 
of approximately » (1/n) . Consider the occurrence or nonoccurrence of a change in each subinterval as 
a Bernoulli trial. By condition (1) we have a sequence of n Bernoulli trials with probability p 
approximately equal to A (1/n). Thus an approximation for P (X = z) is given by the binomial 


probability 
n! A Ae 
il ; 
ai(n—a)!\n n 


In order to obtain a better approximation, choose a large value for n. If n increases without bound, we 
have that 


; n! \\* A\"* n(n —1)...(n-2+1) 7 +h ao De ais 
lim 1 = lim 1 1- — . 
no gl(n—az)!\n n n—00 ne x! n n 


Now, for fixed x, we have 


Jim Bo=Wae—eHt) — him [1 (1-2)... (1- =4)] =1, 
sh ae Mia 


and 


Thus, 


! x n—Z£ xz —X 
lim —™ (3) (1-*) aa6 = P(X =), 


noo g!(n— 2)! \n n x! 


approximately. The distribution of probability associated with this process has a special name. 


POISSON DISTRIBUTION 
We say that the random variable X has a Poisson distribution if its p.d.f. is of the form 


Me? 
figs = 0,1,2,..2, 


where A > 0. 


It is easy to see that f (x) enjoys the properties pf a p.d.f. because clearly f (x) > 0 and, from the 
Maclaurin’s series expansion of e* , we have 


ee 


! 
z=0 a z=0 


To discover the exact role of the parameter A > 0, let us find some of the characteristics of the Poisson 
distribution . The mean for the Poisson distribution is given by 


<. d®e7? _ 
E(X)=Soa 7 Oe aT Cay 


z=0 . 
because (0) f (0) = 0 and z/a! = 1/(2-—1)!,whenz >0. 


If we letk = x — 1, then 
pans 


BX) =D A = Ae =e de *e* = A. 


That is, the parameter ) is the mean of the Poisson distribution. On the Figure 1 is shown the p.d_f. 
and c.d.f. of the Poisson Distribution for A = 1,A = 4,A = 10. 
Poisson Distribution 


The p.d.f. function. The c.d.f. function. 


The p.d.f. and c.d.f. of the Poisson Distribution for \ = 1, A = 4,A = 10. 


To find the variance, we first determine the second factorial moment FE |X (X — 1)]. We have, 


E[X (X—1)| =Sr2(e-1)~* =. 


because (0) (0 — 1)f (0) = 0, (1) (1 — 1) f (1) =0, anda (a — 1)/z! =1/ (a —2)!,whenz > 1. 


If we let k = x — 2, then 


oo k+2 oo 
E[X(X-1J=e*)— ~ avert eee a 
k=0 . = 


Thus, 
Var (X) = E (X*) —[E(X))? = E[X(X —1)] + E(X) -[E(X)?? = 4+A-A7=d. 


That is, for the Poisson distribution, p = 0? = 2. 


Example: 
Let X have a Poisson distribution with a mean of A = 5, (it is possible to use the tabularized Poisson 
distribution). 


6 5%e-5 
P(X <6) = = 0.762, 
(X <6) me I 
P(X >5)=1-— P(X <5) =1—0.616 = 0.384, 


and 


P(X =6) = P(X <6)— P(X <5) =0.762 — 0.616 = 0.146. 


Example: 
Telephone calls enter a college switchboard on the average of two every 3 minutes. If one assumes an 
approximate Poisson process, what is the probability of five or more calls arriving in a 9-minute period? 
Let X denotes the number of calls in a 9-minute period. We see that & (x ) = 6; that is, on the average, 
sic calls will arrive during a 9-minute period. Thus using tabularized data, 

4 te =6 


IO) = Ween (eee pal = = 1— 0.285 = 0.715. 
t=O J 


Note: Not only is the Poisson distribution important in its own right, but it can also be used to 
approximate probabilities for a binomial distribution. 


If X has a Poisson distribution with parameter A , we saw that with n large, 


raroa=()QY 0-3)" 


where, p = A/n so that X = np in the above binomial probability. That is, if X has the binomial 
distribution b(n, p) with large n, then 


ae” _ ("rp 


x! 


This approximation is reasonably good if n is large. But since 4 was fixed constant in that earlier 
argument, p should be small since np = 2. In particular, the approximation is quite accurate if n > 20 
and p < 0.05, and it is very good ifn > 100 andnp < 10. 


Example: 

A manufacturer of Christmas tree bulbs knows that 2% of its bulbs are defective. Approximate the 
probability that a box of 100 of these bulbs contains at most three defective bulbs. Assuming 
independence, we have binomial distribution with parameters p=0.02 and n=100. The Poisson 
distribution with A = 100 (0.02) = 2 gives 


3 9% 6-2 


— = 0.857, 
=O is 


using the binomial distribution, we obtain, after some tedious calculations, 


3 


ye of (0.02)"(0.98)'°°-* = 0.859. 


0) 


Hence, in this case, the Poisson approximation is extremely close to the true value, but much easier to 
find. 


CONTINUOUS DISTRIBUTION 
CONTINUOUS DISTRIBUTION 


RANDOM VARIABLES OF THE CONTINUOUS TYPE 


Random variables whose spaces are not composed of a countable number of points but are 
intervals or a union of intervals are said to be of the continuous type. Recall that the 
relative frequency histogram h (a) associated with n observations of a random variable of 
that type is a nonnegative function defined so that the total area between its graph and the x 
axis equals one. In addition, h (x) is constructed so that the integral 


Equation: 
/ h(a)dx 


a 


is an estimate of the probability P (a < X < b), where the interval (a, b) is a subset of the 
space R of the random variable X. 


Let now consider what happens to the function h (a) in the limit, as n increases without 
bound and as the lengths of the class intervals decrease to zero. It is to be hoped that h (x) 
will become closer and closer to some function, say f (a) , that gives the true probabilities , 
such as P(a < X < b), through the integral 

Equation: 


b 


P(ia<X<b)= | teac. 


a 


PROBABILITY DENSITY FUNCTION 
Function f(x) is a nonnegative function such that the total area between its graph and 
the x axis equals one. 
The probability P (a < X < b) is the area bounded by the graph of f (x) , the x axis, 
and the lines x = aandx=b. 
We say that the probability density function (p.d.f.) of the random variable X of the 
continuous type, with space R that is an interval or union of intervals, is an integrable 
function f (x) satisfying the following conditions: 


° f(x) >0,x belongs to R, 


+f fdr =1, 
R 


¢ The probability of the event A belongs to Ris P(X) € A [ tae. 
A 


Example: 
Let the random variable X be the distance in feet between bad records on a used computer 
tape. Suppose that a reasonable probability model for X is given by the p.d.f. 


i 
f(z)oe 0 SMe), 


Note: R = (x: 0 < x < oo) and f (x) for x belonging to R, 


[o-@) 
1 ; b z 
/ ae) do / —e-*/ gz — lim je*/49) =i lite 7 = 1. 
R 40 b> 00 0 b>00 
0 


The probability that the distance between bad records is greater than 40 feet is 


i 
PAG 40) = / po. =e |'=0.368. 
40 


The p.d.f. and the probability of interest are depicted in FIG.1. 


0.03 


0.02 


f(x) 


0.01 


20 


The p.d.f. and the probability of interest. 


We can avoid repeated references to the space R of the random variable X, one shall adopt 
the same convention when describing probability density function of the continuous type as 
was in the discrete case. 


Let extend the definition of the p.d.f. f (x) to the entire set of real numbers by letting it 
equal zero when, x belongs to R. For example, 


1 .—«/40 
f(z)=45* 0 <x < 00, 
0,elsewhere, 


has the properties of a p.d.f. of a continuous-type random variable x having support 
(x: 0 < x < oo). It will always be understood that f(a) = 0 , when x belongs to R, even 
when this is not explicitly written out. 


PROBABILITY DENSITY FUNCTION 
The distribution function of a random variable X of the continuous type, is defined in 
terms of the p.d.f. of X, and is given by 


Fi(¢)=] P(X <2) - [ro 


For the fundamental theorem of calculus we have, for x values for which the derivative 
F(a) exists, that F’(x)=f(x). 


Example: 
continuing with Example 1 
If the p.d.f. of X is 


0, —co=< 2 =< 0; 
ro { 


we /",0 < & < 0, 


The distribution function of X is F(x) = 0 for xz < 0 


i il 
(a) = [tae = / po tt = ee @ ee o72/40_ 
0 


Note: 


F(a) 0,-—-w<2<0, 
sg) = 
ae 7/*°0 ae Dee On 


Also F'1(0) does not exist. Since there are no steps or jumps in a distribution function F' (x) 
, of the continuous type, it must be true that 


for all real values of b. This agrees with the fact that the integral 


tees 


is taken to be zero in calculus. Thus we see that 


Pia<X<b)=P(a<X<b)=—=P(a< kX <b)—=P(ea<X<b)—F(b)— F(a), 


provided that X is a random variable of the continuous type. Moreover, we can change the 
definition of a p.d.f. of a random variable of the continuous type at a finite (actually 
countable) number of points without alerting the distribution of probability. 


For illustration, 


re 7/*°.0 t= 00, 


0,-—o<2<0, 
f(x)= 
and 


0,-—-w<2<0, 
rae{ 


qe 7/0 << OO, 


are equivalent in the computation of probabilities involving this random variable. 


Example: 
Let Y be a continuous random variable with the p.d.f. g(y) = 2y,0 < y<1.The 
distribution function of Y is defined by 


O,y < 0, 
Ly = 1, 

G(y=] 4 
ra=Po<y<1. 
0 


Figure 2 gives the graph of the p.d.f. g (y) and the graph of the distribution function G (y). 


25 25 


The p.d.f. and the probability of interest. 


For illustration of computations of probabilities, consider 


p(5<¥<4)-46(5)-¢(3)-(4) -G) -& 


and 


Note: The p.d.f. f (a) of a random variable of the discrete type is bounded by one because 
f (x) gives a probability, namely f (x) = P(X = 2). 


For random variables of the continuous type, the p.d.f. does not have to be bounded. The 
restriction is that the area between the p.d.f. and the x axis must equal one. Furthermore, it 
should be noted that the p.d.f. of a random variable X of the continuous type does not need 
to be a continuous function. 


For example, 


f(a) aaa 
2) = ; 


0,elsewhere, 


enjoys the properties of a p.d.f. of a distribution of the continuous type, and yet f (a) had 
discontinuities at x = 0,1,2, and 3. However, the distribution function associates with a 
distribution of the continuous type is always a continuous function. For continuous type 
random variables, the definitions associated with mathematical expectation are the same as 
those in the discrete case except that integrals replace summations. 


FOR ILLUSTRATION, let X be a random variable with a p.d.f. f (x) . The expected 
value of X or mean of X is 


The variance of X is 


o? = Var(X) = / (a — 1)’ f (x)de. 


The standard deviation of X is 


Example: 


c= Var (X). 


For the random variable Y in the Example 3. 


and 


THE UNIFORM AND EXPONENTIAL DISTRIBUTIONS 
THE UNIFORM AND EXPONENTIAL DISTRIBUTIONS 


The Uniform Distribution 


Let the random variable X denote the outcome when a point is selected at random from the 


interval| , |, -co< <_< oo. Ifthe experiment is performed in a fair manner, it is 
reasonable to assume that the probability that the point is selected from the interval | , |, 
< < is( — )( —_ ). That is, the probability is proportional to the length of the 
interval so that the distribution function of X is 
0, < ey 
( ) = — s < , 
les 


Because X is a continuous-type random variable, /( ) is equal to the p.d.f. of X whenever 
1(_) exists; thus when < < _ ,wehave 


DEFINITION OF UNIFORM DISTRIBUTION 
The random variable X has a uniform distribution if its p.d.f. is equal to a constant on its 
support. In particular, if the support is the interval | , |, then 
Equation: 


Moreover, one shall say that X is _( , ). This distribution is referred to as rectangular 
because the graph of ( ) suggest that name. See Figurel. for the graph of ( ) and the 
distribution function F(x). 


Zé 


-5 2 7 
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The graph of the p.d.f. of the uniform distriution. 


Note: We could have taken ( )=Oor ( ) = 0 without alerting the probabilities, since this 
is a continuous type distribution, and it can be done in some cases. 


The mean and variance of X are as follows: 


and 


An important uniform distribution is that for which a=0 and b=1, namely (0,1). If X is 

(0,1), approximate values of X can be simulated on most computers using a random number 
generator. In fact, it should be called a pseudo-random number generator (See the pseudo- 
numbers generation) because the programs that produce the random numbers are usually such 
that if the starting number is known, all subsequent numbers in the sequence may be 
determined by simple arithmetical operations. 


An Exponential Distribution 


Let turn to the continuous distribution that is related to the Poisson distribution. When 
previously observing a process of the approximate Poisson type, we counted the number of 
changes occurring in a given interval. This number was a discrete-type random variable with a 
Poisson distribution. But not only is the number of changes a random variable; the waiting 
times between successive changes are also random variables. However, the latter are of the 
continuous type, since each of then can assume any positive value. 


Let W denote the waiting time until the first change occurs when observing the Poisson process 
in which the mean number of changes in the unit interval is . Then W is a continuous-type 
random variable, and let proceed to find its distribution function. 


Because this waiting time is nonnegative, the distribution function ( )=0, <0. For 
= 0, 


tg= tS fat= £ 2 yale 2 as se 20 Date -s 


since that was previously discovered that ~ equals the probability of no changes in an 
interval of length w is proportional to w, namely, =. Thus when w >0, the p.d.f. of W is given 
by 


(ja - Sos 


DEFINITION OF EXPONENTIAL DISTRIBUTION 


Let =1/ , then the random variable X has an exponential distribution and its p.d.f. id 
defined by 
Equation: 
1 
()=-~/0< <o, 


where the parameter > 0. 


Accordingly, the waiting time W until the first change in a Poisson process has an exponential 


distribution with =1/ .The mean and variance for the exponential distribution are as 


follows: = and ?7= ?, 


Soif is the mean number of changes in the unit interval, then 
=) 


is the mean waiting for the first change. Suppose that =7 is the mean number of changes per 
minute; then that mean waiting time for the first change is 1/7 of a minute. 


The graph of the p.d.f. of the exponential 
distriution. 


Example: 
Let X have an exponential distribution with a mean of 40. The p.d.f. of X is 


1 
Olas, | ws < OO. 
The probability that X is less than 36 is 


36 
1 
(30) ee al (Se 0503: 


0 


Example: 
Let X have an exponential distribution with mean = _ . Then the distribution function of X is 


0,-w< <0O, 
1 = = 0 < OO. 


‘Oe 


The p.d.f. and distribution function are graphed in the Figure 3 for =5. 


The p.d.f. and c.d.f. graphs of the 
exponential distriution with =5. 


Note: For an exponential random variable X, we have that 


( > )=1- ()=1- 1-7! 


te 


THE GAMMA AND CHI-SQUARE DISTRIBUTIONS 


GAMMA AND CHI-SQUARE DISTRIBUTIONS 


In the (approximate) Poisson process with mean A, we have seen that the waiting time until the first change 
has an exponential distribution. Let now W denote the waiting time until the ath change occurs and let find 
the distribution of W. The distribution function of W ,when w > 0 is given by 


F(w) = P(W <w)=1-P(W>w) =1- P(fewer _ than _ a _ changes _ occur _ in _ [0,w)) 
a-l (Aw) "eA 


— kh 


k=0 


since the number of changes in the interval [0,w] has a Poisson distribution with mean Aw. Because W is a 
continuous-type random variable, F'/(w) is equal to the p.d.f. of W whenever this derivative exists. We 
have, provided w>0, that 


<3 | kw) A (Aw) "A Aw)? 
Fi(w) = rer” — eAw > 71 zI = \e rw — ew I, - a 
24 | 


Gamma Distribution 


If w < 0, then F' (w) = 0 and Fi(w) = 0, a p.d.f. of this form is said to be one of the gamma type, 
and the random variable W is said to have the gamma distribution. 
The gamma function is defined by 


co 


T(t)= / y’ te Ydy,0 < t. 
0 


This integral is positive for 0 < t¢, because the integrand id positive. Values of it are often given in a table 
of integrals. If ¢ > 1, integration of gamma fnction of t by parts yields 


P(t) = [-y'te™]> + / (t — 1)y*7e "dy = (¢ - 1) / ye Ydy = (t — 1)’ (t —1). 
0 


Example: 

Let (6) = 52° (5) and I’ (3) = 2 (2) = (2) (1)I' (1). Whenever t = n, a positive integer, we have, be 
repeated application of I’ (¢) = (t — 1) (t — 1), that 

I (n) = (n-1)P (n— 1) = (n— 1) (n — 2)... (2) (1)F (1). 

However, 


NO} [fey 


Thus when n is a positive integer, we have that I’ (n) = (n — 1)!; and, for this reason, the gamma is 
called the generalized factorial. 


Incidentally, I’ (1) corresponds to 0!, and we have noted that I’ (1) = 1, which is consistent with earlier 
discussions. 


SUMMARIZING 


The random variable x has a gamma distribution if its p.d.f. is defined by 
Equation: 


f (a) = Tae ° <r< OM. 


Hence, w, the waiting time until the a th change in a Poisson process, has a gamma distribution with 
parameters a and 6 = 1/A. 


Function f (a) actually has the properties of a p.d.f., because f (x) > 0 and 


oe) 


° go-1e-2#/8 
[ sear | Fae dz, 


—oCo 


which, by the change of variables y = 2/6 equals 


The mean and variance are: pp = a and o? = aé”. 


Gamma Distribution 


The p.d.f. graph. 


The c.d.f. graph. 


The p.d.f. and c.d.f. graphs of the Gamma Distribution. 


Example: 

Suppose that an average of 30 customers per hour arrive at a shop in accordance with Poisson process. 
That is, if a minute is our unit, then \ = 1/2. What is the probability that the shopkeeper will wait more 
than 5 minutes before both of the first two customers arrive? If X denotes the waiting time in minutes until 
the second customer arrives, then X has a gamma distribution with a = 2,0 = 1/X = 2. Hence, 


oe) 


Ail =a) ene? il co 7 
re i ae / i { 2)xe-#/2 — tee = Se 9 — 0.287. 
5 


T (2)2? 4 4 


We could also have used equation with A = 1/6, because a is an integer 


= | a 
<= (5/2)"e°? = e75/2 eee enb/2 
2 2 


Chi-Square Distribution 

Let now consider the special case of the gamma distribution that plays an important role in statistics. 
Let X have a gamma distribution with 6 = 2 and a = r/2, where r is a positive integer. If the p.d.f. of 
X is 
Equation: 


1 
= r/2-1 ,—a/2 < 
f (a) WOPaE a er Ne ae ce: 


We say that X has chi-square distribution with r degrees of freedom, which we abbreviate by saying 
ee 
is x* (r). 


The mean and the variance of this chi-square distributions are 
r 
b=a 5 r 
and 
o’ = af? = (5 )2 = 27: 


That is, the mean equals the number of degrees of freedom and the variance equals twice the number of 
degrees of freedom. 


In the fugure 2 the graphs of chi-square p.d.f. for r=2,3,5, and 8 are given. 


The p.d.f. of chi-square distribution for 
degrees of freedom r=2,3,5,8. 


Note: the relationship between the mean ps = 7, and the point at which the p.d.f. obtains its maximum. 


Because the chi-square distribution is so important in applications, tables have been prepared giving the 
values of the distribution function for selected value of r and x, 
Equation: 


F(z) = [ered 
J P'(r/2)2°/? 


Example: 
Let X have a chi-square distribution with r =5 degrees of freedom. Then, using tabularized values, 


P (1.145 < X < 12.83) = F' (12.83) — F(1.145) = 0.975 — 0.050 = 0.925 
and 


P(X > 15.09) = 1 — F (15.09) = 1— 0.99 = 0.01. 


Example: 
If X is x? (7), two constants, a and b, such that P (a < X < b) = 0.95, are a=1.690 and b=16.01. 
Other constants a and b can be found, this above are only restricted in choices by the limited table. 


Probabilities like that in Example 4 are so important in statistical applications that one uses special symbols 
for a and b. Let a be a positive probability (that is usually less than 0.5) and let X have a chi-square 
distribution with r degrees of freedom. Then x2 (r) is a number such that P[X > x2 (r)]| =a 


That is, x2 (r) is the 100(1-a) percentile (or upper 100a percent point) of the chi-square distribution with r 
degrees of freedom. Then the 100a@ percentile is the number ae (r) such that P [x < Ne (r)| =a. 
This is, the probability to the right of y?_, (r) is 1-a. SEE fugure 3. 


Example: 
Let X have a chi-square distribution with seven degrees of freedom. Then, using tabularized values, 
Xé 05 (7) = 14.07 and x2 gs (7) = 2.167. These are the points that are indicated on Figure 3. 


Neo (0) = IAF and¢ on (7). = 22167. 


NORMAL DISTRIBUTION 


NORMAL DISTRIBUTION 


The normal distribution is perhaps the most important distribution in 
Statistical applications since many measurements have (approximate) 
normal distributions. One explanation of this fact is the role of the normal 
distribution in the Central Theorem. 


The random variable X has a normal distribution if its p.d.f. is defined 


by 
Equation: 


where and __ are parameters satisfying 
, and also where means 
Briefly, we say that X is 


Proof of the p.d.f. properties 


Clearly, . Let now evaluate the integral: 


showing that it is equal to 1. In the integral, change the variables of 
integration by letting . Then, 


since , if , then 


Now 


or equivalently, 


Letting (i.e., using polar coordinates), we have 


The mean and the variance of the normal distribution is as follows: 


and 


That is, the parameters and _ inthep.d.f. are the mean and the variance 
of X. 
Normal Distribution 


Probability Density Function Cumulative Distribution 
Function 


n= 
p= 
he 
he 


p.d.f. and c.d.f graphs of the Normal Distribution 


Example: 
If the p.d.f. of X is 


then X is 
That is, X has a normal distribution with a mean 
and the moment generating function 


=-7, variance 


=16, 


THE t DISTRIBUTION 


THE t DISTRIBUTION 


In probability and statistics, the t-distribution or Student's distribution arises in the problem of estimating the 
mean of a normally distributed population when the sample size is small, as well as when (as in nearly all practical 
statistical work) the population standard deviation is unknown and has to be estimated from the data. 

Textbook problems treating the standard deviation as if it were known are of two kinds: 


1. those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were 
certain, 

2. those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is 
temporarily ignored because that is not the point that the author or instructor is then explaining. 


THE t DISTRIBUTION 


t Distribution 
If Z is a random variable that is N (0,1), if U is a random variable that is x? (r), and if Z and U are 
independent, then 
Equation: 


has a t distribution with r degrees of freedom. 


Where p is the population mean, z is the sample mean and s is the estimator for population standard deviation 
(i.e., the sample variance) defined by 
Equation: 


If o = s, t = z, the distribution becomes the normal distribution. As N increases, Student’s t distribution 
approaches the normal distribution. It can be derived by transforming student’s z-distribution using 


and then defining 


t=zv¥n-—-1. 


The resulting probability and cumulative distribution functions are: 
Equation: 


P(r +1)/2 
Jarl (r/2)(1 + t2/r) Ord?’ 


f= 


Equation: 


1 
F(t) = I i;=7, Pl ie sgn (t) = = 
M=5+5 579 ree 23 O=5 o/a ltl (41) 
where, 
e r =n — 1 is the number of degrees of freedom, 
e —co <t<o, 
e I'(z) is the gamma function, 
¢ B(a, b) is the bets function, 
e I (z;a,b) is the regularized beta function defined by 
B(z;a,6 
I (z;a,b) = B(%a,b) 
B(a,b) 


The effect of degree of freedom on the t distribution is illustrated in the four t distributions on the Figure 1. 


3.18  -196 196 3.18 


245 ice 2.45 


p.d.f. of the t distribution for degrees of freedom r=3, r=6, 
Yr=00. 


In general, it is difficult to evaluate the distribution function of T. Some values are usually given in the tables. Also 
observe that the graph of the p.d.f. of T is symmetrical with respect to the vertical axis t =0 and is very similar to 
the graph of the p.d.f. of the standard normal distribution N (0,1). However the tails of the t distribution are 
heavier that those of a normal one; that is, there is more extreme probability in the t distribution than in the 
standardized normal one. Because of the symmetry of the t distribution about t =0, the mean (if it exists) must be 
equal to zero. That is, it can be shown that F(T’) = 0 when r > 2. When r=1 the t distribution is the Cauchy 
distribution, and thus both the variance and mean do not exist. 


Estimation 


ESTIMATION 


Once a model is specified with its parameters and data have been collected, 
one is in a position to evaluate the model’s goodness of fit, that is, how well 
the model fits the observed pattern of data. Finding parameter values of a 
model that best fits the data — a procedure called parameter estimation, 
which assesses goodness of fit. 


There are two generally accepted methods of parameter estimation. They 
are least squares estimation (LSE) and maximum likelihood estimation 
(MLE). The former is well known as linear regression, the sum of squares 
error, and the root means squared deviation is tied to the method. On the 
other hand, MLE is not widely recognized among modelers in psychology, 
though it is, by far, the most commonly used method of parameter 
estimation in the statistics community. LSE might be useful for obtaining a 
descriptive measure for the purpose of summarizing observed data, but 
MLE is more suitable for statistical inference such as model comparison. 
LSE has no basis for constructing confidence intervals or testing hypotheses 
whereas both are naturally built into MLE. 


Properties of Estimators 
UNBIASED AND BIASED ESTIMATORS 


Let consider random variables for which the functional form of the p.d-f. is 
know, but the distribution depends on an unknown parameter 9 , that may 
have any value in a set 0, which is called the parameter space. In 
estimation the random sample from the distribution is taken to elicit some 
information about the unknown parameter 0. The experiment is repeated n 
independent times, the sample Xj, X9,...,X, is observed and one try to 
guess the value of # using the observations 71, £2,...2n. 


The function of X1, X2,...,Xn used to guess 0 is called an estimator of 6 . 
We want it to be such that the computed estimate u (x1, £2,...%,) is usually 


close to 0. Let Y = wu (21, £2,...Xn) be an estimator of 6. If Y to be a good 
estimator of 0 , a very desirable property is that it means be equal to 0 , 
namely £ (Y) = 6. 


If EB [uw (x1, 22,..-,2n)] = 6 is called an unbiased estimator of 0. 
Otherwise, it is said to be biased. 


It is required not only that an estimator has expectation equal to 8, but also 
the variance of the estimator should be as small as possible. If there are two 
unbiased estimators of 0, it could be probably possible to choose the one 
with the smaller variance. In general, with a random sample X1, X9,...,Xn 
of a fixed sample size n, a statistician might like to find the estimator 

Y =u(Xq, Xo,...,Xn) of an unknown parameter 6 which minimizes the 
mean (expected) value of the square error (difference) Y — @ that is, 
minimizes 


E [(v 7 6)" - B{(u Ol ie ae 


The statistic Y that minimizes EF [(v — 6)*] is the one with minimum 


mean square error. If we restrict our attention to unbiased estimators only, 
then 


Var (Y)=£ [(v — 6)°| : 


and the unbiased statistics Y that minimizes this expression is said to be the 
unbiased minimum variance estimator of 0 . 


Method of Moments 


One of the oldest procedures for estimating parameters is the method of 
moments. Another method for finding an estimator of an unknown 
parameter is called the method of maximum likelihood. In general, in the 
method of moments, if there are k parameters that have to be estimated, the 


first k sample moments are set equal to the first k population moments that 
are given in terms of the unknown parameters. 


Example: 

Let the distribution of X be N (1,07) . Then E(X) = wand 

E (X 2) =o? + ie Given a random sample of size n, the first two 
moments are given by 


1 n 

my, = — ) Li 
Wb. Se 
o—I 


and 


and 


1 n 
—) 2; =O +" 
War 

i=1 


The first equation yields z as the estimate of jz. Replacing ju? with x? in 
the second equation and solving for 0? , 
we obtain 


n 
1 2 
— Li -xr =UWv 
Th = 

j=l 


for the solution of o? . 


Thus the method of moment estimators for pz and o? are ji = X and 
a” = V. Of course, @ = X is unbiased whereas G” = V. is biased. 


At this stage arises the question, which of two different estimators 6 and 6, 
for a parameter 8 one should use. Most statistician select he one that has the 
smallest mean square error, for example, 


E (0-0) | <E [(6-6)'], 


then 6 seems to be preferred. This means that if & (6) =f (4) = 9, then 


one would select the one with the smallest variance. 


Next, other questions should be considered. Namely, given an estimate for a 
parameter, how accurate is the estimate? How confident one is about the 
closeness of the estimate to the unknown parameter? 


Note: CONFIDENCE INTERVALS I and CONFIDENCE INTERVALS II 


CONFIDENCE INTERVALS I 


CONFIDENCE INTERVALS I 


Given a random sample X1, X9,...,X,, from a normal distribution 

N (4,07), consider the closeness of X, the unbiased estimator of 4, to 
the unknown p. To do this, the error structure (distribution) of X, 
namely that X is N (u, o i. n), is used in order to construct what is 
called a confidence interval for the unknown parameter jz, when the 
variance o is known. 


For the probability 1 — a, it is possible to find a number 2,/2, such that 


a 
P (-sa< j us < an) =1l-a. 


a/s/n 


For example, if 1 — a = 0.95, then 2/2 = 20.025 = 1.96 and if 
1—a = 0.90, then 2. /2 = 2.95 = 1.649. 


Recalling that o > 0, the following inequalities are equivalent : 


and 


sen o = or 
X+ton(—) =u 2X24 (=). 
Thus, since the probability of the first of these is 1-1 — a, the probability of 


the last must also be 1 — a, because the latter is true if and only if the 
former is true. That is, 


PX ~ saa(—) cns-K+a(Z) eee 


So the probability that the random interval 


ma). mo) 


includes the unknown mean pw isl—a. 


Once the sample is observed and the sample mean computed equal to 
x , the interval 


x — Za/2 (o/Vn), x + Za/2 (o/Vn) 


is a known interval. Since the probability that the random interval 
covers pt before the sample is drawn is equal to 1 — a, call the 
computed interval, © + 24/5 (o/./n) (for brevity), a 100 (1 — a) % 
confidence interval for the unknown mean pL. 

The number 100 (1 — a) %, or equivalently, 1 — a, is called the 
confidence coefficient. 


For illustration, 
z+ 1.96 (o//n) 


is a 95% confidence interval for p. 


It can be seen that the confidence interval for jz is centered at the point 
estimate x and is completed by subtracting and adding the quantity 


Za./2 (o//n). 


Note: as n increases, 2/2 (o i Jn) decreases, resulting n a shorter 
confidence interval with the same confidence coefficient 1 — a 


A shorter confidence interval indicates that there is more reliance in % as an 
estimate of jz. For a fixed sample size n, the length of the confidence 
interval can also be shortened by decreasing the confidence coefficient 

1 — a. But if this is done, shorter confidence is achieved by losing some 
confidence. 


Example: 

Let z be the observed sample mean of 16 items of a random sample from 
the normal distribution NV (yu, 0”). A 90% confidence interval for the 
unknown mean jp is 


23.04 23.04 
= — 1.6454) ,E+1.645 oul 
16 6 


For a particular sample this interval either does or does not contain the 
mean ps. However, if many such intervals were calculated, it should be true 
that about 90% of them contain the mean yp. 


If one cannot assume that the distribution from which the sample arose is 
normal, one can still obtain an approximate confidence interval for pu. By 


the Central Limit Theorem the ratio (x — 1) / (a//n) has, provided that 


n is large enough, the approximate normal distribution N (0, 1) when the 
underlying distribution is not normal. In this case 


X— 
i —%q/2 < = Za/2|] ~ ba, 


a//n 


Fan (Fa) #+ (Fe) 


is an approximate 100 (1 — a) % confidence interval for jz. The closeness 
of the approximate probability 1 — a to the exact probability depends on 
both the underlying distribution and the sample size. When the underlying 
distribution is unimodal (has only one mode) and continuous, the 
approximation is usually quite good for even small n, such as n = 5. As the 
underlying distribution becomes less normal (i.e., badly skewed or 
discrete), a larger sample size might be required to keep reasonably accurate 
approximation. But, in all cases, an n of at least 30 is usually quite 
adequate. 


and 


Note: Confidence Intervals II 


CONFIDENCE INTERVALS II 
CONFIDENCE INTERVALS I 


Confidence Intervals for Means 


In the preceding considerations (Confidence Intervals I), the confidence interval for the mean 

of a normal distribution was found, assuming that the value of the standard deviation is 
known. However, in most applications, the value of the standard deviation — is rather 
unknown, although in some cases one might have a very good idea about its value. 


Suppose that the underlying distribution is normal and that —_is unknown. It is shown that 
given random sample from a normal distribution, the statistic 


has a t distribution with degrees of freedom, where __ is the usual unbiased 
estimator of _, (see, t distribution). 


Select so that 


Then 


Thus the observations of arandom sample providea and and 
es — isa interval for 


Example: 

Let X equals the amount of butterfat in pound produced by a typical cow during a 305-day 

milk production period between her first and second claves. Assume the distribution of X is 
. To estimate a farmer measures the butterfat production for n-20 cows yielding 

the following data: 


481 537 513 583 453 510 570 


500 487 559 618 327 350 643 
499 421 505 637 599 392 - 
For these data, and . Thus a point estimate of is . Since 
, a 90% confidence interval for is —= ,or 


equivalently, [472.80, 542.20]. 


Let T have a t distribution with n-1 degrees of freedom. Then, : 
Consequently, the interval ~ is expected to be shorter than the interval 
~, After all, there gives more information, namely the value of , in 


construction the first interval. However, the length of the second interval is very much 
dependent on the value of s. If the observed s is smaller than, a shorter confidence interval 
could result by the second scheme. But on the average, is the shorter of the 


two confidence intervals. 


If it is not possible to assume that the underlying distribution is normal but and_ are both 
unknown, approximate confidence intervals for can still be constructed using 


which now only has an approximate t distribution. 


Generally, this approximation is quite good for many normal distributions, in particular, if the 
underlying distribution is symmetric, unimodal, and of the continuous type. However, if the 
distribution is highly skewed, there is a great danger using this approximation. In such a 
situation, it would be safer to use certain nonparametric method for finding a confidence 
interval for the median of the distribution. 


Confidence Interval for Variances 


The confidence interval for the variance _ is based on the sample variance 


In order to find a confidence interval for _, it is used that the distribution of 
is . The constants a and b should selected from tabularized Chi Squared 
Distribution with n-1 degrees of freedom such that 


That is select a and b so that the probabilities in two tails are equal: 


and 


Then, solving the inequalities, we have 


Thus the probability that the random interval 


contains the unknown is 1- . Once the values of are observed to be 
and computed, then the interval 


isa confidence interval for 


It follows that 


isa confidence interval for , the standard deviation. 


Example: 
Assume that the time in days required for maturation of seeds of a species of a flowering 
plant found in Mexico is . Arandom sample of n=13 seeds, both parents having 


narrow leaves, yielded =18.97 days and 


A confidence interval for is — , because 
and , what can be read from the tabularized Chi Squared Distribution. 


The corresponding 90% confidence interval for is 


Although a and b are generally selected so that the probabilities in the two tails are equal, the 
resulting confidence interval is not the shortest that can be formed using the 
available data. The tables and appendixes gives solutions for a and b that yield confidence 
interval of minimum length for the standard deviation. 


SAMPLE SIZE 


Size Sample 


Very frequently asked question in statistical consulting is, how large 
should the sample size be to estimate a mean? 


The answer will depend on the variation associated with the random 
variable under observation. The statistician could correctly respond, only 
one item is needed, provided that the standard deviation of the distribution 
is zero. That is, if o is equal zero, then the value of that one item would 
necessarily equal the unknown mean of the distribution. This is the extreme 
case and one that is not met in practice. However, the smaller the variance, 
the smaller the sample size needed to achieve a given degree of accuracy. 


Example: 

A mathematics department wishes to evaluate a new method of teaching 
calculus that does mathematics using a computer. At the end of the course, 
the evaluation will be made on the basis of scores of the participating 
students on a standard test. Because there is an interest in estimating the 
mean score pu, for students taking calculus using computer so there is a 
desire to determine the number of students, n, who are to be selected at 
random from a larger group. So, let find the sample size n such that we are 
fairly confident that x contains the unknown test mean pu, from past 
experience it is believed that the standard deviation associated with this 
type of test is 15. Accordingly, using the fact that the sample mean of the 
test scores, X , is approximately N (u o n), it is seen that the interval 


given by x ( n) will serve as an approximate 95% confidence 
interval for ju. 


That is, = or equivalently n and thus n 


or n=865 because n must be an integer. It is quite likely that it had not been 
anticipated that as many as 865 students would be needed in this study. If 
that is the case, the statistician must discuss with those involved in the 
experiment whether or not the accuracy and the confidence level could be 


relaxed some. For illustration, rather than requiring x to be a 95% 


confidence interval for jz, possibly x would be satisfactory for 80% 
one. If this modification is acceptable, we now have = or 
equivalently, n and thus n . Since n must be an integer 


= 93 is used in practice. 


Most likely, the person involved in this project would find this a more 
reasonable sample size. Of course, any sample size greater than 93 could be 
used. Then either the length of the confidence interval could be decreased 
from that of x or the confidence coefficient could be increased from 
80% or a combination of both. Also, since there might be some question of 
whether the standard deviation o actually equals 15, the sample standard 
deviations would no doubt be used in the construction of the interval. 


For example, suppose that the sample characteristics observed are 


n z 8 

then, x —* or provides an approximate 80% confidence 

n 
interval for ju. 
In general, if we want the a confidence interval for p, 
G. Ze (o n), to be no longer than that givenbyz _ ¢, the sample 
size n is the solution of ¢€ a where ® (Za ) = 
That is, 

Zz. Oo 
a 
n 
E 


where it is assumed that 0 is known. 


Sometimes 


E 2a. te- 


is called the maximum error of the estimate. If the experimenter has no 
ideas about the value of o , it may be necessary to first take a preliminary 
sample to estimate co . 


The type of statistic we see most often in newspaper and magazines is an 
estimate of a proportion p. We might, for example, want to know the 
percentage of the labor force that is unemployed or the percentage of voters 
favoring a certain candidate. Sometimes extremely important decisions are 
made on the basis of these estimates. If this is the case, we would most 
certainly desire short confidence intervals for p with large confidence 
coefficients. We recognize that these conditions will require a large sample 
size. On the other hand, if the fraction p being estimated is not too 
important, an estimate associated with a longer confidence interval with a 
smaller confidence coefficients is satisfactory; and thus a smaller sample 
size can be used. 


In general, to find the required sample size to estimate p, recall that the 
point estimate of p is 


Suppose we want an estimate of p that is within ¢ of the unknown p with 
a confidence whereeé Zq Dp ( p) mn is the maximum 


error of the point estimate p =y n. Since p is unknown before the 
experiment is run, we cannot use the value of pin our determination of n. 
However, if it is known that p is about equal to p , the necessary sample 
size n is the solution of 


Za VP p 


n 


That is, 


Maximum Likelihood Estimation (MLE) 
MAXIMUM LIKELIHOOD ESTIMATION (MLE) 


Likelihood function 


From a Statistical standpoint, the data vector x = (21, £2, ..., Z,) as the outcome of an experiment is 
a random sample from an unknown population. The goal of data analysis is to identify the 
population that is most likely to have generated the sample. In statistics, each population is 
identified by a corresponding probability distribution. Associated with each probability distribution 
is a unique value of the model’s parameter. As the parameter changes in value, different probability 
distributions are generated. Formally, a model is defined as the family of probability distributions 
indexed by the model’s parameters. 


Let denote the probability distribution function (PDF) by f (|@) that specifies the probability of 
observing data y given the parameter w. The parameter vector 0 = (01, 9, ..., 9x) is a vector 
defined on a multi-dimensional parameter space. If individual observations, z;/s are statistically 
independent of one another, then according to the theory of probability, the PDF for the data 

— (x4, do eee Ln) can be expressed as a multiplication of PDFs for individual observations, 


f(z, 0) = f (a1, O)f (2,0) +--+ f (an, 9), 


L(0)= — f (a;l8). 


i=l 


To illustrate the idea of a PDF, consider the simplest case with one observation and one parameter, 
that is, n = k = 1. Suppose that the data x represents the number of successes in a sequence of 10 
independent binary trials (e.g., coin tossing experiment) and that the probability of a success on any 
one trial, represented by the parameter, @ is 0.2. The PDF in this case is then given by 


10! - 
f (z|@ = 0.2) = ao tay 02) 0-8)" , (a = 0.1, ..., 10), 


which is known as the binomial probability distribution. The shape of this PDF is shown in the top 
panel of Figure 1. If the parameter value is changed to say w = 0.7, anew PDF is obtained as 


f (x|@ = 0.7) = so ar 0-7)" (0.8)"*, (« = 0.1,..., 10); 


whose shape is shown in the bottom panel of Figure 1. The following is the general expression of 
the binomial PDF for arbitrary values of 6 and n: 


n\ 


O(n — 2)! 


f (z|@) = i—0)*,0<b<la=Chigm 


which as a function of y specifies the probability of data y for a given value of the parameter 0 . The 
collection of all such PDFs generated by varying parameter across its range (0 - 1 in this case) 
defines a model. 


PDF f(ylw=0.2) 


POF f(y|w=0.7) 
is 


S 
= 


Data y 


Binomial probability distributions of sample size n = 10 and 
probability parameter 0 = 0.2 (top) and 6 = 0.7 (bottom). 


Maximum Likelihood Estimation 


Once data have been collected and the likelihood function of a model given the data is determined, 
one is in a position to make statistical inferences about the population, that is, the probability 
distribution that underlies the data. Given that different parameter values index different probability 
distributions (Eigure 1), we are interested in finding the parameter value that corresponds to the 
desired PDF. 


The principle of maximum likelihood estimation (MLE), originally developed by R. A. Fisher in 
the 1920s, states that the desired probability distribution be the one that makes the observed data 
most likely, which is obtained by seeking the value of the parameter vector that maximizes the 
likelihood function L (6). The resulting parameter, which is sought by searching the 
multidimensional parameter space, is called the MLE estimate, denoted by 


O0MLE = (0,MLE,...,0,MLE). 


Let p equal the probability of success in a sequence of Bernoulli trials or the proportion of the large 

population with a certain characteristic. The method of moments estimate for p is relative frequency 
of success (having that characteristic). It will be shown below that the maximum likelihood estimate 
for p is also the relative frequency of success. 


Suppose that X is b(1, p) so that the p.d-f. of X is 


f (ep) =p"(1-p)"*,2 =0,1,0<p<1. 
Sometimes is written 
pEeQ=(p:0<p< Ij, 


where {2 is used to represent parameter space, that is, the space of all possible values of the 
parameter. A random sample X 1, X9,..., X» is taken, and the problem is to find an estimator 

u (X1, Xe, ..., Xn) such that u(x, 22, ...,£n) is a good point estimate of p, where x1, £2, ..., 2p are 
the observed values of the random sample. Now the probability that X1, Xo, ..., X, takes the 
particular values is 


P(X, =41,..,Xn=2n)= pe(l—p)/*=p a2(1—p)™ ®, 
i=l 


which is the joint p.d.f. of X ,, X29, ..., X» evaluated at the observed values. One reasonable way to 
proceed towards finding a good estimate of p is to regard this probability (or joint p.d-f.) asa 
function of p and find the value of p that maximizes it. That is, find the p value most likely to have 
produced these sample values. The joint p.d.f., when regarded as a function of p, is frequently called 
the likelihood function. Thus here the likelihood function is: 


L(p) = L(p; 21, 22,...,2n) = f (vi;p)f (2;p)---f(anjp)=p “(l—p)” “,0<p<1. 


To find the value of p that maximizes L (p) first take its derivative forO <<p<1: 


dL (p) n- x; n—- 4; x; n—- 2«,-1 
dp tip (Lp) SS ep Aap) 
P 
Setting this first derivative equal to zero gives 
( - ; Li n— Li 
p *(1—p)” * =), 
Pp 1—p 
Since 0 < p < 1, this equals zero when 
Be, - Te Bo 
=0; 
Pp 1—p 
Or, equivalently, 
Li 
Pp = =“ 
n 
The corresponding statistics, namely X;/n = X, is called the maximum likelihood estimator 


and is denoted by p ,that is, 


1 
n. 


When finding a maximum likelihood estimator, it is often easier to find the value of parameter that 
minimizes the natural logarithm of the likelihood function rather than the value of the parameter that 
minimizes the likelihood function itself. Because the natural logarithm function is an increasing 
function, the solution will be the same. To see this, the example which was considered above gives 
for0<p< 1, 


In L (p) = zi _np+ n-— «a; In(1—p). 
i=1 i=1 


To find the maximum, set the first derivative equal to zero to obtain 


d[mLip))_ 1 GT at Lg 
dp a” 8 a 1—p 


which is the same as previous equation. Thus the solution is p = x and the maximum likelihood 
estimator for pisp = X. 


Motivated by the preceding illustration, the formal definition of maximum likelihood estimators is 
presented. This definition is used in both the discrete and continuous cases. In many practical cases, 
these estimators (and estimates) are unique. For many applications there is just one unknown 
parameter. In this case the likelihood function is given by 


n 


LO) =f (wi,8). 


i=1 


Note: Maximum Likelihood Estimation - Examples 


Maximum Likelihood Estimation - Examples 


MAXIMUM LIKELIHOOD ESTIMATION - EXAMPLES 


EXPONENTIAL DISTRIBUTION 


Let X1, X2,...,X, be a random sample from the exponential distribution with p.d-f. 
1 
f(#0)= ac 8 <2<00,0€ 2= {0:0 <6 < oo}. 


The likelihood function is given by 


1 1 1 1 a=), 
—_ i = —21/0 —x2/0 _— —2n/0 
L(O) = L(6s 04,035; Fn) & ) (§¢ ) & ) = Gn exP F 


The natural logarithm of L (9) is 
1 n 
In L (8) = — (n) In (6) — 9 220 <0 < oo. 


Thus, 


d[nL(6)|} —-n fa 0 
dO 8 QR 
The solution of this equation for 6 is 
1 n 
0 = — Li = x 
a i=1 
Note that, 
d(In L (0) 1 NE = 
77, a(-" 7) > 8<E, 


d(In L (0) 1 NE = 
77, 5 ( n )=00=2 


d(In L (0) i NE 
n 
do ] 


This is both an unbiased estimator and the method of moments estimator for 0. 


,0<A0<o. 


GEOMETRIC DISTRIBUTION 
Let Xj, Xo, ...,X, be a random sample from the geometric distribution with p.d-f. 
f (a;p) = (1—p)* “p, x = 1,2,3,.... 
The likelihood function is given by 
L (p) = (1—p)""*p(1 — p)* 1p A — pp = pL — pp), 0 < ps. 


The natural logarithm of L (9) is 


InL (p) =nlnp+ (So -»] In(l—p),0<p<l. 


i=1 


Thus restricting p to0 < p< 1soas to be able to take the derivative, we have 


n 
; rn 
i=1 


dinL(p) nn _ 
dp p 1—p 
Solving for p, we obtain 
n 1 
Ra 7, — F 
dt 
i=1 
So the maximum likelihood estimator of p is 
a n 1 
p — r = xX 
dX 
i=1 


Again this estimator is the method of moments estimator, and it agrees with the intuition because, inn 


n 
observations of a geometric random variable, there are n successes in the S x, trials. Thus the estimate of p is 
i=1 
the number of successes divided by the total number of trials. 


NORMAL DISTRIBUTION 
Let Xj, X2,..., X, be arandom sample from N (01, 62), where 
2 = ((01, 02) : —00 < 61 < 00,0 < 8 < 0). 


That is, here let 9; = ps and 02 = o”. Then 


_T 1 és ee ey 
L (61,92) = LI (a >| 20> }), 


or equivalently, 


The natural logarithm of the likelihood function is 


In L (01,62) = — ln (2162) = 


The partial derivatives with respect to 6, and 6 are 


dnl) 1¢< 


i=1 
and 
O(InL — line 
(In ) | i f S> (ai - 01)’. 
002 20. 202 
The equation oun} = 0 has the solution 6; = Z. Setting oun = 0 and replacing 6, by % yields 


2 


i< = 


By considering the usual condition on the second partial derivatives, these solutions do provide a maximum. Thus 
the maximum likelihood estimators 


B=, 
and 
a” = 0» 
are 
6,=X 
and 
noe (x, = x) 


Where we compare the above example with the introductory one, we see that the method of moments estimators 
and the maximum likelihood estimators for and o? are the same. But this is not always the case. If they are not 
the same, which is better? Due to the fact that the maximum likelihood estimator of @ has an approximate normal 
distribution with mean @ and a variance that is equal to a certain lower bound, thus at least approximately, it is 
unbiased minimum variance estimator. Accordingly, most statisticians prefer the maximum likelihood estimators 
than estimators found using the method of moments. 


BINOMIAL DISTRIBUTION 


Observations: k successes in n Bernoulli trials. 


3 
3 
3 


a x,-p>S 2;-npt+ x 
i=1 i=1 i=1 
n 
yt 
~ i=l k 
p —s) = 
n n 
POISSON DISTRIBUTION 
Observations: 21, £2,..., Zn; 
Ne 
flz)= ,c = 0,1,2 
x! 


InL (A) = —-An+ So a;mA-In 
i=1 


i= 


dl a 
dy a 


ASYMPTOTIC DISTRIBUTION OF MAXIMUM LIKELIHOOD ESTIMATORS 


ASYMPTOTIC DISTRIBUTION OF MAXIMUM LIKELIHOOD 
ESTIMATORS 


Let consider a distribution with p.d.f. f (2; 9) such that the parameter @ is not involved in the 


support of the distribution. We want to be able to find the maximum likelihood estimator 6 by 
solving 


8 [In L (6)] 
7 


where here the partial derivative was used because L (0) involves 21, 2, ..., Ln. 


That is, 


where now, with @ in this expression, 


L (6) = f (X38) f (X28) +f (Xn). 


We can approximate the left-hand member of this latter equation by a linear function found 
from the first two terms of a Taylor’s series expanded about 6 , namely 


dlnL(6)] . \ 62[InL(6)] 
a0 + (6-9) a0? 


when L (@) = f (X1; 9) f (X23) +++ f (Xn3 9). 


~ 0, 


Obviously, this approximation is good enough only if 6 is close to 9, and an adequate 
mathematical proof involves those conditions. But a heuristic argument can be made by 


solving for 6 — 6 to obtain 


Equation: 
dln L(6)] 
A _ 30 
Ooi d?[In L(8)| 
Oe? 
Recall that 


In L (6) = Inf (X1;0) + In f (Xo;36) +---+ In f (X13; 9) 


and 
Equation: 


. 


dn L (6) 57 Ala FH 9) 
ao 


00 


i=1 


The expression (2) is the sum of the n independent and identically distributed random 
variables 


__ Alin f (Xi) 


xy 
06 


ft aL De seach 


and thus the Central Limit Theorem has an approximate normal distribution with mean (in 
the continuous case) equal to 


Equation: 
f Alnf(aiz%), .,  f Alf (ws) Fes), ff OLF (wi 9] 
/ 76 f (2; 0)dz = i: 0 F(a: 8) dz = / a eee dz 


218) | #(es8)a0 25200: 


Clearly, the mathematical condition is needed that it is permissible to interchange the 
operations of integration and differentiation in those last steps. Of course, the integral of 
f (xi; ) is equal to one because it is a p.d.f. 


Since we know that the mean of each Y is 


f 2a£ei seid = 


06 


—cCo 
let us take derivatives of each member of this equation with respect to 6 obtaining 


i 0? [In f (a;; )] O [In f (x;;6)] O[f (x5 4)| _ 
fp OO ee 


—Co 
However, 


Of (xi; 9)| = 0 [In f (2:34) 
00 00 


f (x; 9) 


SO 


[Eee | rei / 7 ees F(a: 8)de. 


Since E'(Y) = 0, this last expression provides the variance of Y = O [In f (X; 0)|/d0. Then 
the variance of expression (2) is n times this value, namely 


ang { Hn fes) 


0g? 


Let us rewrite (1) as 
Equation: 


Jn (4 - 6) A[ln L(9)]/09 


_ _V-FIG? lin F(X0)]/00"} 


1—/—E{@ [in f(X;0)//007} 3 An 
E{—6?[In f(X;)]/067} 


The numerator of (4) has an approximate NV (0, 1) distribution; and those unstated 
: be : : 1 07[In L(6)| 
mathematical condition require, in some sense for — <- —5g2 to converge to 
E |—6? [In f (X; 6)]/067]. Accordingly, the ratios given in equation (4) must be 
approximately NV (0,1) . That is, 6 has an approximate normal distribution with mean 8 and 
standard deviation 1 


/—nE{ 8? [In f(X;6)]/00"} 


Example: 
With the underlying exponential p.d-f. 


i(z30) = ae 7,0 < 2 < 00,02 = {6:0 <0 < oo}. 


X is the maximum likelihood estimator. Since In f (a; 0) = —In@ — and 
ofl 0 O71 30 
eee I —4+ 4 and eee I = a — 3, we have 


/2 i Oe at 
6° 8B Be 


because F(X) = 0. That is, X has an approximate distribution with mean 6 and standard 
deviation @/./n. Thus the random interval X + 1.96 (@/,/n) has an approximate 


probability of 0.95 for covering 9. Substituting the observed Z for 0 , as well as for X , we 
say that Z + 1.96% /,/n is an approximate 95% confidence interval for 0. 


Example: 
The maximum likelihood estimator for A in 


zr 

ee) — AS 2 =0,1,2,..50€ =O Oro} 
is \ = X Now Inf (z;A) = Ind —A—Inz! and oe = $—1and 
8 EEA hs (-+) = ». = + andX = X has an approximate normal 

On ty » bY X 

distribution with mean 4 and standard deviation ,/A/n. Finally + 1.645,/Z/n serves as 
an approximate 90% confidence interval for A. With the data from example(...) ® = 2.225 
and hence this interval is from 1.887 to 2.563. 


It is interesting that there is another theorem which is somewhat related to the preceding 


result in that the variance of 6 serves as a lower bound for the variance of every unbiased 
estimator of 8 . Thus we know that if a certain unbiased estimator has a variance equal to that 
lower bound, we cannot find a better one and hence it is the best in the sense of being the 
unbiased minimum variance estimator. This is called the Rao-Cramer Inequality. 


Let X1, X9,..., X, be arandom sample from a distribution with p.d.f. 
F(#0),0€.0={0:c<0 < a}, 


where the support X does not depend upon Oso that we can differentiate, with respect to 9, 
under integral signs like that in the following integral: 


co 


/ tae ods 


—co 


If Y = u(Xyq, Xo, ..., Xn) is an unbiased estimator of 0, then 


Var (Y) > —————— = a 
n / {[Oln f (2;6)/O6]}"f (a:6)de / (8? In f (2; 8)/062]f (2; Ade 


Note that the two integrals in the respective denominators are the expectations 


ef [eee 


5 ee) 
00? 


and 


sometimes one is easier to compute that the other. 


Note that above the lower bound of two distributions: exponential and Poisson was 
computed. Those respective lower bounds were /;, and »/n. Since in each case, the variance 


of X equals the lower bound, then X is the unbiased minimum variance estimator. 


Example: 
The sample arises from a distribution with p.d-f. 


f (a9) = 02° 1,0<2<1,0E€N={0:0<0< oo}. 


We have 

ron :0 1 

In f (z;0) = Iné+ (9-1)Ing, Obs) = —+lInz, 
06 0 
and 
07 In f (x; 0) =. 
eee IN 

Since F (-1 | 0”) = —1/6, the lower bound of the variance of every unbiased estimator of 


6 is 62 /n. Moreover, the maximum likelihood estimator 
m n 
i=1 
has an approximate normal distribution with mean @ and variance 6? /n. Thus, in a limiting 


sense, 9 is the unbiased minimum variance estimator of 0. 


To measure the value of estimators; their variances are compared to the Rao-Cramer lower 
bound. The ratio of the Rao-Cramer lower bound to the actual variance of any unbiased 
estimator is called the efficiency of that estimator. As estimator with efficiency of 50% 


requires that 1/0.5=2 times as many sample observations are needed to do as well in 
estimation as can be done with the unbiased minimum variance estimator (then 100% 
efficient estimator). 


TEST ABOUT PROPORTIONS 


TEST ABOUT PROPORTIONS 


Tests of statistical hypotheses are a very important topic, let introduce it through an 
illustration. 


Suppose a manufacturer of a certain printed circuit observes that about p=0.05 of the 
circuits fails. An engineer and statistician working together suggest some changes that 
might improve the design of the product. To test this new procedure, it was agreed that 
n=100 circuits would be produced using the proposed method and the checked. Let Y equal 
the number of these 200 circuits that fail. Clearly, if the number of failures, Y, is such that 
Y/200 is about to 0.05, then it seems that the new procedure has not resulted in an 
improvement. On the other hand, If Y is small so that Y/200 is about 0.01 or 0.02, we might 
believe that the new method is better than the old one. On the other hand, if Y/200 is 0.08 or 
0.09, the proposed method has perhaps caused a greater proportion of failures. What is 
needed is to establish a formal rule that tells when to accept the new procedure as an 
improvement. For example, we could accept the new procedure as an improvement if 

Y < 5of Y/n < 0.025. We do note, however, that the probability of the failure could still 
be about p=0.05 even with the new procedure, and yet we could observe 5 of fewer failures 
in n=200 trials. 


That is, we would accept the new method as being an improvement when, in fact, it was 
not. This decision is a mistake which we call a Type I error. On the other hand, the new 
procedure might actually improve the product so that p is much smaller, say p=0.02, and yet 
we could observe y=7 failures so that y/200=0.035. Thus we would not accept the new 
method as resulting in an improvement when in fact it had. This decision would also be a 
mistake which we call a Type II error. 


If it we believe these trials, using the new procedure, are independent and have about the 
same probability of failure on each trial, then Y is binomial b (200, p). We wish to make a 
statistical inference about p using the unbiased p = Y /200. We could also construct a 
confidence interval, say one that has 95% confidence, obtaining 


pA) 


pC 96 
P 200 


This inference is very appropriate and many statisticians simply do this. If the limits of this 
confidence interval contain 0.05, they would not say the new procedure is necessarily better, 
al least until more data are taken. If, on the other hand, the upper limit of this confidence 
interval is less than 0.05, then they fell 95% confident that the true p is now less than 0.05. 


Here, in this illustration, we are testing whether or not the probability of failure has or has 
not decreased from 0.05 when the new manufacturing procedure is used. 


The no change hypothesis, Ho : p = 0.05, is called the null hypothesis. Since 
Ho : p = 0.05 completely specifies the distribution it is called a simple hypothesis; thus 
Ho : p = 0.05 is a simple null hypothesis. 


The research worker’s hypothesis Hy : p < 0.05 is called the alternative hypothesis. 
Since H, : p < 0.05 does not completely specify the distribution, it is a composite 
hypothesis because it is composed of many simple hypotheses. 


The rule of rejecting Ho and accepting H, if Y < 5, and otherwise accepting Ho is called a 
test of a statistical hypothesis. 
It is clearly seen that two types of errors can be recorded 


e Type I error: Rejecting Ho and accepting H,, when Ap is true; 
¢ Type II error: Accepting Hp when H is true, that is, when H) is false. 


Since, in the example above, we make a Type I error if Y < 5 when in fact p=0.05. we can 
calculate the probability of this error, which we denote by a@ and call the significance level 
of the test. Under an assumption, it is 


> 200 200— 
oP (Y = bp =0.05) = (0.05)%(0.95)". 


y=0 


Since n is rather large and p is small, these binomial probabilities can be approximated 
extremely well by Poisson probabilities with A = 200 (0.05) = 10. That is, from the 
Poisson table, the probability of the Type I error is 


5 = 
10%e-10 
bes —— — 0.067. 
y=0 ¥y 


Thus, the approximate significance level of this test is a = 0.067. This value is reasonably 
small. However, what about the probability of Type II error in case p has been improved to 
0.02, say? This error occurs if Y > 5 when, in fact, p=0.02; hence its probability, denoted 
by 8, is 

200 

200 = 
B=P(Y >5;p=0.02) = (0.02)%(0.98)7°°¥. 
y=6 


Again we use the Poisson approximation, here A=200(0.02)=4, to obtain 


= 1-— 0.785 = 0.215. 


y=0 y! 


The engineers and the statisticians who created this new procedure probably are not too 
pleased with this answer. That is, they note that if their new procedure of manufacturing 
circuits has actually decreased the probability of failure to 0.02 from 0.05 (a big 
improvement), there is still a good chance, 0.215, that Ho: p=0.05 is accepted and their 
improvement rejected. Thus, this test of Ho: p=0.05 against H,;: p=0.02 is 
unsatisfactory. Without worrying more about the probability of the Type II error, here, 
above was presented a frequently used procedure for testing Ho: p=py, where pg is some 
specified probability of success. This test is based upon the fact that the number of 
successes, Y, in n independent Bernoulli trials is such that Y /n has an approximate normal 
distribution, N[p9, Po(1- py) /n], provided Ho: p=pg is true and n is large. Suppose the 
alternative hypothesis is Ho: p>py ; that is, it has been hypothesized by a research worker 
that something has been done to increase the probability of success. Consider the test of 
Ho: p=py against H;: p> pg that rejects Ho and accepts Hy, if and only if 


ga —Y/n—Po 
vV'po (1 — po)/n 


That is, if Y /n exceeds pg by standard deviations of Y/n, we reject Ho and accept the 
hypothesis H;: p> po. Since, under Ho Z is approximately N (0,1), the approximate 
probability of this occurring when Ho: p=py is true is a. That is the significance level of 
that test is approximately a. If the alternative is H1: p< pg instead of Hj: p> po, then the 
appropriate a-level test is given by Z < —Zq. That is, if Y/n is smaller than py by 
standard deviations of Y/n, we accept H,: p< po. 


a ae 


In general, without changing the sample size or the type of the test of the hypothesis, a 
decrease in a causes an increase in 8, and a decrease in @ causes an increase in a. Both 
probabilities a and £ of the two types of errors can be decreased only by increasing the 
sample size or, in some way, constructing a better test of the hypothesis. 


EXAMPLE 


If n=100 and we desire a test with significance level a=0.05, then 
a=P X >c;u=60 =0.05 means, since X is N(u,100/100=1), 


X — 60 c — 60 
> . 


tas =60 =0.05 
1 = 7 iM 


and c — 60 = 1.645. Thus c=61.645. The power function is 


X—p . 61.645 —w_ 


K (u)=P X>61.645;4 =P ——> pi = 1-9 (61.645 — p). 


In particular, this means that 6 at u=65 is 
=1-K (zu) = (61.645 — 65) = $ (—3.355) & 0; 


so, with n=100, both a and @ have decreased from their respective original values of 0.1587 
and 0.0668 when n=25. Rather than guess at the value of n, an ideal power function 
determines the sample size. Let us use a critical region of the form x > c. Further, suppose 
that we want a=0.025 and, when p=65, G=0.05. Thus, since X is N(u,100/n), 


c — 60 


0.025 =P - Xoo ep 60 S12 
10/n 


and 
c — 65 
10//n 


0.05=1-P X>cjp=65 = 


: c—60 __ c—65) 
That is, ga 1.96 and ya —1.645. 


Solving these equations simultaneously for c and 10/./n, we obtain 
5 
c = 60 + 1.96 —__ = 62.718; 
3.605 


10 5 
Jn 3.605" 


Thus, ./n = 7.21 and n = 51.98. Since n must be an integer, we would use n=52 and 
obtain a=0.025 and $=0.05, approximately. 


For a number of years there has been another value associated with a statistical test, and 
most statistical computer programs automatically print this out; it is called the probability 
value or, for brevity, p-value. The p-value associated with a test is the probability that we 
obtain the observed value of the test statistic or a value that is more extreme in the direction 
of the alternative hypothesis, calculated when Hp is true. Rather than select the critical 
region ahead of time, the p-value of a test can be reported and the reader then makes a 
decision. 


Say we are testing Ho: w=60 against H;: 4>60 with a sample mean X based on n=52 
observations. Suppose that we obtain the observed sample mean of x = 62.75. If we 
compute the probability of obtaining an z of that value of 62.75 or greater when p=60, then 
we obtain the p-value associated with x = 62.75. That is, 


_ _ ie = X-60 62.75-60.,, 
p—value=P X > 62.75;54=60 =P ie = ivea i — 60 


_ 62.75-60  __ = 

=1-¢@ oe 1 — (1.983) = 0.0237. 
If this p-value is small, we tend to reject the hypothesis Hp: ~=60 . For example, rejection 
of Ho: u=60 if the p-value is less than or equal to 0.025 is exactly the same as rejection if 
x = 62.718.That is, x = 62.718 has a p-value of 0.025. To help keep the definition of p- 
value in mind, we note that it can be thought of as that tail-end probability, under Ho, of 
the distribution of the statistic, here X, beyond the observed value of the statistic. See 
Figure 1 for the p-value associated with x = 62.75. 


f(x) 


p-value = 0.0237 


The p-value associated with x = 62.75. 


Example: 

Suppose that in the past, a golfer’s scores have been (approximately) normally distributed 
with mean p=90 and o7=9. After taking some lessons, the golfer has reason to believe that 
the mean p has decreased. (We assume that o? is still about 9.) To test the null hypothesis 
Ho: w=90 against the alternative hypothesis H;: 4 < 90, the golfer plays 16 games, 
computing the sample mean z.If x is small, say x < c, then Ho is rejected and Hy, 


accepted; that is, it seems as if the mean yz has actually decreased after the lessons. If 
c=88.5, then the power function of the test is 


X—p — 885—y 
i 4 


88.5 — pb 


RUN Pe 88 Pe 
(7) < Ub 3/4 


ee ae 


Because 9/16 is the variance of X. In particular, 
a = K (90) = &(—2) = 1 — 0.9772 = 0.0228. 


If, in fact, the true mean is equal to 4=88 after the lessons, the power is 
K (88) = (2/3) = 0.7475. If u=87, then K (87) = (2) = 0.9772. An observed 
sample mean of x = 88.25 has a 


2a 
p— value =P Xi 983.25: — 90) = @ aes aU dese =¢@ a 
3/4 o 


and this would lead to a rejection at a=0.0228 (or even a=0.01). 


TESTS ABOUT ONE MEAN AND ONE VARIANCE 


TESTS ABOUT ONE MEAN AND ONE VARIANCE 


In the previous paragraphs it was assumed that we were sampling from a 
normal distribution and the variance was known. The null hypothesis was 
generally of the form Ho: w= po. 


There are essentially tree possibilities for the alternative hypothesis, namely 
that yw has increased, 


1. Hy: 4 > po; ws has decreased, 
2. Hy: u < po; pw has changed, but it is not known if it has increased or 
decreased, which leads to a two-sided alternative hypothesis 


3. Ay; bb F bo. 


To test Ho; 4 = po against one of these tree alternative hypotheses, a 
random sample is taken from the distribution, and an observed sample 
mean, 2, that is close to 49 supports Ho. The closeness of x to [Uo is 
measured in term of standard deviations of X, a/ ,/n which is sometimes 
called the standard error of the mean. Thus the statistic could be defined 


by 


xX — xX — 
= Ho | Lo 


— Vo2/n — o/Vn' 


and the critical regions, at a significance level a, for the tree respective 
alternative hypotheses would be: 


i eee 
2:8 S24 
ZS Ze /2 


In terms of x these tree critical regions become 


1.2 > Up + Zo /r/n, 
2.2 < po — zqa/V/n, 


3. |2 — po] > Zaa/s/n 


These tests and critical regions are summarized in TABLE 1. The 
underlying assumption is that the distribution is N j,0? and o? is known. 
Thus far we have assumed that the variance o? was known. We now take a 
more realistic position and assume that the variance is unknown. Suppose 
our null hypothesis is Ho; 2 = po and the two-sided alternative hypothesis 
is Hy; u ~ po. If arandom sample Xj, X9,..., Xp is taken from a normal 
distribution Np, o” ,let recall that a confidence interval for pt was based 
on 


Aj 2—-p# 


C= = ‘ 
S2/n S/n 
Ho Ay Critical Region 
Lb = bo LL > Lo Z> Zqorxe > pot 2zg0//n 
LL = Ho LL < Ho Z< 2% ore < ply — %qa/s/n 
L = bo LF bo |z| > Za/2 or |x — po| > Zas2a//n 


TABLE 1 


This suggests that T might be a good statistic to use for the test Ho; up = [Uo 
with yz replaced by pip. In addition, it is the natural statistic to use if we 


replace o”/n by its unbiased estimator S?/n in X—po / o?/nina 


proper equation. If 2 = fo we know that T has a t distribution with n-1 
degrees of freedom. Thus, with 4 = [Uo, 


X — Mo 
Sa 


Accordingly, if x and s are the sample mean and the sample standard 
deviation, the rule that rejects Ho; 4 = po if and only if 


P |T|>tag(n—1) =P > tyjg(n—1) =a. 


C= 
4) — ool 5 taj2(n — 1). 


s/yn 


Provides the test of the hypothesis with significance level a. It should be 
noted that this rule is equivalent to rejecting Ho; 4 = [po if fo is not in the 
open 100(1— a) confidence interval 


t — ta/o(n —1)s/Vn, 2 4+ tajo(n—1)s/Vn . 


Table 2 summarizes tests of hypotheses for a single mean, along with the 
three possible alternative hypotheses, when the underlying distribution is 
N po? , ois unknown, t = (2 — po)/ s/n andn < 31. If n>31, 
use table 1 for approximate tests with o replaced by s. 


Ho Ay Critical Region 


t >tg(n—1) or 


t < —ta(n —1) or 


HL = Ho Le < Ho at < po — ta(n—1)s//n 


lt] > tay (m— 1) or 


B= Ho HEH | Ie pol > taja(n—1)s/Vm 


TABLE 2 


Example: 

Let X (in millimeters) equal the growth in 15 days of a tumor induced in a 
mouse. Assume that the distribution of X is Nz, 07 . We shall test the 
null hypothesis Hg : 4 = to = 4.0 millimeters against the two-sided 
alternative hypothesis is H, : w ~ 4.0. If we use n=9 observations and a 
significance level of a =0.10, the critical region is 


|x — 4.0] 
s/V9 


If we are given that n=9, x=4.3, and s=1.2, we see that 


ie] = > tayo (8) = to.os (8) = 1.860. 


t= 4352.0 0:37 
1.2/V/9 0.4 


Thus |t| = |0.75| < 1.860 and we accept (do not reject) Hp : = 4.0 at 
the a=10% significance level. See Figure 1. 


0.75. 


0.4 


0.3 


alfa/2 = 0.05 


se 


Rejection region at the a = 10 


significance level. 


Note: In discussing the test of a statistical hypothesis, the word accept 
might better be replaced by do not reject. That is, in Example 1, x is close 
enough to 4.0 so that we accept =4.0, we do not want that acceptance to 
imply that p is actually equal to 4.0. We want to say that the data do not 
deviate enough from y=4.0 for us to reject that hypothesis; that is, we do 
not reject u=4.0 with these observed data, With this understanding, one 
sometimes uses accept and sometimes fail to reject or do not reject, the 
null hypothesis. 


In this example the use of the t-statistic with a one-sided alternative 
hypothesis will be illustrated. 


Example: 

In attempting to control the strength of the wastes discharged into a nearby 
river, a paper firm has taken a number of measures. Members of the firm 
believe that they have reduced the oxygen-consuming power of their 
wastes from a previous mean p of 500. They plan to test Hp : wu = 500 
against H, : u < 500, using readings taken on n=25 consecutive days. If 
these 25 values can be treated as a random sample, then the critical region, 
for a significance level of a=0.01, is 


xz — 500 
3/25 


t= < —t,91 (24) = —2.492. 


The observed values of the sample mean and sample standard deviation 
were £2=308.8 and s=115.15. Since 


308.8 — 500 
{= ~~ = - 8.30 < —2.492, 
115.15 //25 


we Clearly reject the null hypothesis and accept Hy; : 4 < 500. It should be 
noted, however, that although an improvement has been made, there still 
might exist the question of whether the improvement is adequate. The 95% 
confidence interval 308.8 + 2.064 (115.15/5) or (261.27, 356.33] for ys 
might the company answer that question. 


TEST OF THE EQUALITY OF TWO INDEPENDENT NORMAL DISTRIBUTIONS 


TEST OF THE EQUALITY OF TWO INDEPENDENT NORMAL DISTRIBUTIONS 


Let X and Y have independent normal distributions V (ite a2) and N (les ge); respectively. There are times 
when we are interested in testing whether the distribution of X and Y are the same. So if the assumption of 


normality is valid, we would be interested in testing whether the two variances are equal and whether the two 
mean are equal. 


Let first consider a test of the equality of the two means. When X and Y are independent and normally 
distributed, we can test hypotheses about their means using the same t-statistic that was used previously. 
Recall that the t-statistic used for constructing the confidence interval assumed that the variances of X and Y 
are equal. That is why we shall later consider a test for the equality of two variances. 


Let start with an example and then let give a table that lists some hypotheses and critical regions. 


Example: 

A botanist is interested in comparing the growth response of dwarf pea stems to two different levels of the 
hormone indoeacetic acid (IAA). Using 16-day-old pea plants, the botanist obtains 5-millimeter sections and 
floats these sections with different hormone concentrations to observe the effect of the hormone on the 
growth of the pea stem. 

Let X and Y denote, respectively, the independent growths that can be attributed to the hormone during the 
first 26 hours after sectioning for (0.5)(10) * and (10) * levels of concentration of IAA. The botanist 
would like to test the null hypothesis Ho : 42 — fy = O against the alternative hypothesis 

AN, : fz — fy < 0. If we can assume X and Y are independent and normally distributed with common 
variance, respective random samples of size n and m give a test based on the statistic 


x-—Y x-—Y 


= a7 ——<—————) 
if [(n - 1)Sz + (m— 1)83,|/(n +m — 2)} (1/n+ifm)  Sevi/n+i/m 


where 


/e — 1)83 + (m—1)83 
Sp= 3 
nm m— 2 


T has a t distribution with r = n + m — 2 degrees of freedom when Hp is true and the variances are 
(approximately) equal. The hypothesis Ho will be rejected in favor of H, if the observed value of T is less 
than —t, (n+ m — 2). 


Example: 
In the example 1, the botanist measured the growths of pea stem segments, in millimeters, for n=11 
observations of X given in the Table 1: 


0.8 1.8 1.0 0.1 0.9 1.7 1.0 1.4 0.9 1.2 0.5 
Table 1 


and m=13 observations of Y given in the Table 2: 


1.0 0.8 1.6 2.6 1.3 1.1 2.4 1.8 2.5 1.4 1.9 2.0 1.2 
Table 2 


For these data, = 1.03, s? = 0.24, 7 = 1.66, and os = 0.35. The critical region for testing 
Ho : 2 — Py = O against Hy : wz — py < Oist < —to05 (22) = —1.717. Since Hp is clearly rejected at 
a=0.05 significance level. 


Note: an approximate p-value of this test is 0.005 because —to.o5 (22) = —2.819. Also, the sample 
variances do not differ too much; thus most statisticians would use this two sample t-test. 


BEST CRITICAL REGIONS 


BEST CRITICAL REGIONS 


In this paragraph, let consider the properties a satisfactory test should posses. 


Consider the test of the sample null hypothesis Ho : 6 = 60 against the simple 
alternative hypothesis H; : 0 = 6. 
Let C be a critical region of size a; that is, a = P(C; 9). Then C is a best critical 


region of size a if, for every other critical region D of size a = P (D; 09), we have 
that 


P(C;61) > P(D; 41). 
That is, when H, : 6 = @, is true, the probability of rejecting Hp : 0 = 09 using the 


critical region C is at least as great as the corresponding probability using any other 
critical region D of size a. 


Thus a best critical region of size a is the critical region that has the greatest power 
among all critical regions for a best critical region of size a. The Neyman-Pearson 
lemma gives sufficient conditions for a best critical region of size a. 


Neyman-Pearson Lemma 


Let X1, X2,..., Xn be a random sample of size n from a distribution with p.d.f. f (x; 6), 
where 0 and @; are two possible values of 0. 


Denote the joint p.d.f. of X1, X2,..., X» by the likelihood function 
L (0) = L(G; 01, €2,..,@n) = f (a1; 6) f (250) --- f (an; 9). 
If there exist a positive constant k and a subset C of the sample space such that 


A; P|(Xi, Xo, Sieg iAere) EC; 60] =a, 
py GD) < k for (#1, £2,...,2n) € C, 
2 


Le) = k for (a1, £2,..-)2n) € Cl. 


Then C is a best critical region of size a for testing the simple null hypothesis 
Ho : 8 = 6 against the simple alternative hypothesis H; : 0 = @;. 


For a realistic application of the Neyman-Pearson lemma, consider the following, in 
which the test is based on a random sample from a normal distribution. 


Example: 
Let X1, Xo, ..., X, be a random sample from a normal distribution N (jw, 36). We shall 
find the best critical region for testing the simple hypothesis Ho : 4 = 50 against the 
simple alternative hypothesis H; : 4 = 55. Using the ratio of the likelihood functions, 
namely L (50)/L (55), we shall find those points in the sample space for which this ratio 
is less than or equal to some constant k. 
That is, we shall solve the following inequality: 

ecw 

1 

(720) ”/? exp (+) Site. -s5) 
1 


= exp | = (x S24 nao? — ss) cons 
1 


If we take the natural logarithm of each member of the inequality, we find that 


—n/2 al 
(727) ? exp | — Cy 


bt 
lot 


—10 5) a; — n50° + n55” < (72) Ink. 
il 
Thus, 
= ee Toy W780” — n55? + (72) Ink] 


Or equivalently, > c, where c = a [n50? —n55? + (72)In ele 
Thus L (50) /L (55) < k is equivalent to & > c. 
A best critical region is, according to the Neyman-Pearson lemma, 


Cla lind ein ete en 


where Cc is selected so that the size of the critical region is a. Say n=16 and c=53. Since 
X is N (50, 36/16) under Ho we have 


X — 50 3 


= P(X > 53; =50) =P] > 
x aes 6/4. ~ 6/4 


ee = 50 | a2) 10.0228; 


The example 1 illustrates what is often true, namely, that the inequality 


can be expressed in terms of a function u (x1, £2, ..., Zn) Say, 
Gis Paiste) 01 

or 
u(£1, 22, eg a) = C2; 


where c, and Cz is selected so that the size of the critical region is a . Thus the test can be 
based on the statistic u (X1,..., X,). Also, for illustration, if we want a@ to be a given 
value, say 0.05, we would then choose our c, and cg. In example1, with a=0.05, we want 


a AS06750 ~ 50 
0.05 = P(X > cu =50) =P Se eG -1-9(5 Mh 
6/4 6/4 6/4 


Hence it mist be true that (c — 50)/ (3/2) = 1.645, or equivalently, 
c= 50+ 2 (1.645) ~ 52.47. 


Example: 
LetX1, Xo, ..., X» denote a random sample of size n from a Poisson distribution with 
mean AX. A best critical region for testing Ho : A = 2 against H; : A = 5 is given by 


E(2) _ 92> tien2n Flare oe ee 
(5) -_ Ly1!aq!---x,! 5d tie—5n Ale 


The inequality is equivalent to = dF gin < k and oS a) In = Som Ink. 
Since In (2/5) < 0, this is the same as 


aE Ink — 3n =e 
Os 


If n=4 and c=13, then 


4 
=P (> >13;A= ) = 1— 0.936 = 0.064, 


4 
from the tables, since is X; has a Poisson distribution with mean 8 when A=2. 
i=1 


When Ho : 8 = 09 and H, : 6 = 6; are both simple hypotheses, a critical region of size a 
is a best critical region if the probability of rejecting Hg when H, is true is a maximum 
when compared with all other critical regions of size a. The test using the best critical 
region is called a most powerful test because it has the greatest value of the power 
function at 9 = 0; when compared with that of other tests of significance level a. If H, is 
a composite hypothesis, the power of a test depends on each simple alternative in H, . 


A test, defined by a critical region C of size a, is a uniformly most powerful test if 
it is a most powerful test against each simple alternative in H,. The critical region C 
is called a uniformly most powerful critical region of size a. 


Let now consider the example when the alternative is composite. 


Example: 
Let X1, X9,..., X, be a random sample from N (, 36). We have seen that when testing 
Ao : w = 50 against H, : w = 55, a best critical region C is defined by 


C= rita eer ee 


where Cc is selected so that the significance level is a. Now consider testing Hp : uw = 50 
against the one-sided composite alternative hypothesis H, : ~ > 50. For each simple 
hypothesis in M1, say 2 = p4; the quotient of the likelihood functions is 


(727) ~"/? exp 


(i) es? 


= | (2) {2 7 50) oa +n (502 aa}. 


Now L (50)/L (1) < kif and only if 


Thus the best critical region of size a for testing Hp : u = 50 against Hy : wu = wy, 
where j4; > 50, is given by 


where is selected such that 


P(X > c;Ho: w= 50) =a. 


Note: the same value of c can be used for each jz; > 50 , but of course k does not remain 
the same. Since the critical region C defines a test that is most powerful against each 
simple alternative j4; > 50, this is a uniformly most powerful test, and C is a uniformly 
most powerful critical region if size a. Again if a=0.05, then c = 52.47. 


HYPOTHESES TESTING 


Hypotheses Testing - Examples. 


Example: 


We have tossed a coin 50 times and we got k = 19 heads. Should we accept/reject the hypothesis that p = 
0.5, provided taht the coin is fair? 
Null versus Alternative Hypothesis: 


¢ Null hypothesis (Ho) : p = 0.5. 
e Alternative hypothesis (H1) : p 4 0.5. 


EXPERIMENT 


Significance level a = Probability of Type I error = Pr[rejecting Ho | Ho true] 
P[k < 18 ork > 32]< 0.05. 


If k < 18 ork > 32]< 0.05, then under the null hypothesis the observed event falls into rejection region 
with the probability a < 0.05. 


Note: We want a as small as possible. 


reject accept reject 


Test construction. 


Cumulative distribution function. 


Note: No evidence to reject the null hypothesis. 


Example: 

We have tossed a coin 50 times and we got k = 10 heads. Should we accept/reject the hypothesis that p = 
0.5, provided taht the coin is fair? 

EXPERIMENT 


Cumulative distribution function. 


P[k < 10 or k > 40] 0.000025. We could reject hypothesis Ho at a significance level as low as 
a = 0.000025. 


Note: p-value is the lowest attainable significance level. 


Note: In STATISTICS, to prove something = reject the hypothesis that converse is true. 


Example: 


We know that on average mouse tail is 5 cm long. We have a group of 10 mice, and give to each of them a 


dose of vitamin T everyday, from the birth, for the period of 6 months. 


We want to prove that vitamin X makes mouse tail longer. We measure tail lengths of out group and we get 


the following sample: 


5.5 5.6 4.3 Del 5.2 6.1 5.0 5.2 


Table 1 


e Hypothesis Ho - sample = sample from normal distribution with pz = 5 cm. 
e Altemative H, - sample = sample from normal distribution with pz > 5 cm. 


CONSTRUCTION OF THE TEST 


Teject 


> 


a 


ty 95 


Cannot reject 


5.8 4.1 


We do not know population variance, and/or we suspect that vitamin treatment may change the variance - so 


we use t distribution. 


Example: 
x? test (K. Pearson, 1900) 


To test the hypothesis that a given data actually come from a population with the proposed distribution. Data 
is given in the Table 2. 


0.4319 0.6874 0.5301 0.8774 0.6698 1.1900 0.4360 0.2192 0.5082 
0.3564 1.2521 0.7744 0.1954 0.3075 0.6193 0.4527 0.1843 2.2617 
0.4048 2.3923 0.7029 0.9500 0.1074 3.3593 0.2112 0.0237 0.0080 
0.1897 0.6592 0.5572 1.2336 0.3527 0.9115 0.0326 0.2555 0.7095 
0.2360 1.0536 0.6569 0.0552 0.3046 1.2388 0.1402 0.3712 1.6093 
1.2595 0.3991 0.3698 0.7944 0.4425 0.6363 2.5008 2.8841 0.9300 
3.4827 0.7658 0.3049 1.9015 2.6742 0.3923 0.3974 3.3202 3.2906 
1.3283 0.4263 2.2836 0.8007 0.3678 0.2654 0.2938 1.9808 0.6311 
0.6535 0.8325 1.4987 0.3137 0.2862 0.2545 0.5899 0.4713 1.6893 
0.6375 0.2674 0.0907 1.0383 1.0939 0.1155 1.1676 0.1737 0.0769 
1.1692 1.1440 2.4005 2.0369 0.3560 1.3249 0.1358 1.3994 1.4138 
0.0046 - - - - e Z : 7 
DATA 


Exercise: 


Problem: Are these data sampled from population with exponential p.d.f.? 


Solution: 


fee". 
CONSTRUCTION OF THE TEST 


Cannot reject 


Exercise: 


Problem: Are these data sampled from population with exponential p.d.f.? 


Solution: 
f(zj\=ee™. 


1. Estimate a. 
2. Use x? test. 
3. Remember d.f. = K-2. 


Actual 

Situation HA, true 
decision accept 
probability l-a 


TABLE 1 


Reject = error t. I 


a = significance 
level 


H, false 

reject Accept = error t. 
I 

1 — B= power of the 3 

test 


PSEUDO-NUMBERS 


UNIFORM PSEUDO-RANDOM VARIABLE GENERATION 


In this paragraph, our goals will be to look at, in more detail, how and 
whether particular types of pseudo-random variable generators work, and 
how, if necessary, we can implement a generator of our own choosing. 
Below a list of requirements is listed for our uniform random variable 
generator: 


1. A uniform marginal distribution, 

2. Independence of the uniform variables, 
3. Repeatability and portability, 

4. Computational speed. 


CURRENT ALGORITHMS 


The generation of pseudo-random variates through algorithmic methods is a 
mature field in the sense that a great deal is known theoretically about 
different classes of algorithms, and in the sense that particular algorithms in 
each of those classes have been shown, upon testing, to have good 
Statistical properties. In this section, let describe the main classes of 
generators, and then let make specific recommendation about which 
generators should be implemented. 


Congruential Generators 


The most widely used and best understood class of pseudo-random number 
generators are those based on the linear congruential method introduced by 
Lehmer (1951). Such generators are based on the following formula: 
Equation: 


U; = (aU;_1 + c)modm, 


where U;,7 = 1,2,... are the output random integers; Uo is the chosen 
Starting value for the recursion, called the seed and a,c, and m are 
prechosen constants. 


Note: to convert to uniform (0,1) variates, we need only divide by 
modulus m, that is, we use the sequence {U;/m} . 


The following properties of the algorithm are worth stating explicitly: 


1. Because of the “mod m” operation (for background on modular 
operations, see Knuth, (1981) ), the only possible values the algorithm 
can produce are the integers 0,1,2,...,m — 1. This follows because, by 
definition, x mod m is the remainder after x is divided by m. 

2. Because the current random integer U; depends only on the previous 
random integer U;_, once a previous value has been repeated, the 
entire sequence after it must be repeated. Such a repeating sequence is 
called a cycle, and its period is the cycle length. Clearly, the 
maximum period of the congruential generator is m. For given 
choices of a, c, and m, a generator may contain many short cycles, (see 
the Example 1 below), and the cycle you enter will depend on the seed 
you start with. Notice that the generator with many short cycles is not 
a good one, since the output sequence will be one of a number of short 
series, each of which may not be uniformly distributed or randomly 
dispersed on the line or the plane. Moreover, if the simulation is long 
enough to cause the random numbers to repeat because of the short 
cycle length, the outputs will not be independent. 

3. If we are concern with a uniform (0,1) variates, the finest partition of 
the interval (0,1) that this generator can provide is 
[0,1/m,2/m,...,(m — 1/m)]. This is, of course, not truly a uniform 
(0,1) distribution since, for any k in (0,m — 1) , we have 
P\k/m <U < (k+1)/m] = 0, not 1/m are required by theory for 
continuous random variables. 


4. Choices of a,c, and m, will determine not only the fineness of the 
partition of (0,1) and the cycle length, and therefore, the uniformity of 
the marginal distribution, but also the independence properties of the 
output sequence. Properly choosing a,c, and m is a science that 
incorporates both theoretical results and empirical tests. The first rule 
is to select the modulus m to be “as large as possible”, so that there is 
some hope to address point 3 above and to generate uniform variates 
with an approximately uniform marginal distribution. However, simply 
having m large is not enough; one may still find that the generator has 
many short cycles, or that the sequence is not approximately 
independent. See example 1 below. 


Example: 
Consider 
Equation: 


U; = 2U;-1 mod Oe 


Where a seed of the form 2* creates a loop containing only integers that are 
powers of 2, or 
Equation: 


U; = (U;_1+ 1) mod 2” 


which generates the nonrandom sequence of increasing integers. Therefore, 
the second equation gives a generator that has the maximum possible cycle 
length but is useless for simulating a random sequence. 


Fortunately, one a value of the m has been selected; theoretical results exist 
that give conditions for choosing values of the multiplier a and the additive 
constant c such that all the possible integers, 0 through m — 1, are 
generated before any are repeated. 


Note: this does not eliminate the second counterexample above, which 
already has the maximal cycle length, but is a useless random number 
generator. 


THEOREM I 


A linear congruential generator will have maximal cycle length m, if and 
only if: 


¢ cis nonzero and is relatively prime to m (i.e., c and m have no 
common prime factors). 

e (amodq) = 1 for each prime factor q of m. 

e (amod 4) = 1 if 4 isa factor of m. 


PROOF 


Note: Knuth (1981, p.16). 


As a mathematical note, c is called relatively prime to m if and only if c and 
m have no common divisor other than 1, which is equivalent to c and m 
having no common prime factor. 


A related result concerns the case of c chosen to be 0. This case does not 
conform to condition in a Theorem J, a value U; of zero must be avoided 
because the generator will continue to produce zero after the first 
occurrence of a zero. In particular, a seed of zero is not allowable. By 
Theorem I, a generator with c = 0, which is called a multiplicative 
congruential generator, cannot have maximal cycle length m. However, 
By Theorem II. It can have cycle length m — 1. 


THEOREM II 


If c = 0 ina linear congruential generator, then U; = 0 can never be 
included in a cycle, since the 0 will always repeat. However, the generator 
will cycle through all m — 1 integers in the set (a mod q) if and only if: 


e m isa prime integer and 
¢ misa primitive element modulo m. 


PROOF 


Note: Knuth (1981, p.19). 


A formal definition or primitive elements modulo m, as well as theoretical 
results for finding them, are given in Knuth (1981). In effect, when m is a 
prime, a is a primitive element if the cycle is of length m — 1. The results 
of Theorem II are not intuitively useful, but for our purposes, it is enough to 
note that such primitive elements exist and have veen computed by 
researchers, 


Note: e.g., Table24.8 in Abramowitz and Stegun, 1965. 


Hence, we now must select one of two possibilities: 


¢ Choose a, c, and m according to Theorem I and work with a generator 
whose cycle length is known to be m. 

¢ Choose c = 0, take a and m according to Theorem I, use a number 
other than zero as the seed, and work with a generator whose cycle 
length is known to be m — 1. A generator satisfying these conditions 
is known as a prime-modulus multiplicative congruential generator 
and, because of the simpler computation, it usually has an advantage in 
terms of speed over the mixed congruential generator. 


Another method frequency speeding up a random number generator that has 
c = Ois to choose the modulus m to be computationally convenient. For 
instance, consider m = 2*. This is clearly not a prime number, but ona 
computer the modulus operation becomes a bit-shift operation in machine 
code. In such cases, Theorem III gives a guise to the maximal cycle length. 


THEOREM III 


If c = 0 and m = 2* with k > 2, then the maximal possible cycle length is 
2-2. This is achieved if and only if two conditions hold: 


¢ aisa primitive element modulo m. 
e the seed is odd. 


PROOF 


Note: Knuth (1981, p.19). 


Notice that we sacrifice some of the cycle length and, as we will se in 
Theorem IV, we also lose some randomness in the low-order bits of the 
random variates. Having use any of Theorems I, II, or II to select triples (a, 
c, m) that lead to generators with sufficiently long cycles of known length, 
we can ask which triple gives the most random (i.e., approximately 
independent ) sequence. Although some theoretical results exist for 
generators as a whole, these are generally too weak to eliminate any but the 
worst generators. Marsaglia (1985) and Knuth(1981, Chap. 3.3.3) are 
good sources for material on that results. 


THEOREM IV 


If U; = aU;_, mod 2*, and we define 
Equation: 


Y; = U;mod2/,0<j<k 


then 
Equation: 


Y; = aY;_1 mod 2’. 


In practical terms, this means that the sequence of j-lo-order binary bits of 
the U; sequence, namely Y; cycle with cycle length at most 2/. In particular, 
sequence of the least significant bit (i.e., j=1) in (U,, U>, U3,...) must 
behave as (0,0,0,0,...), (1,1,1,1,...), (0,1,0,1,...) or (1,0,1,0,...). 


PROOF 


Note: Knuth (1981, pp. 12-14). 


Such normal behavior in the low-order bits of a congruential generator with 
non-prime-modulus m is an undesirably property, which may be aggravated 
by techniques such as the recycling of uniform variates. It has been 
observed (Hutchinson, 1966) that prime-modulus multiplicative 
congruential generators with full cycle (i.e., when m is a positive primitive 
element) tend to have fairly randomly distributed low-order bits, although 
no theory exists to explain this. 


THEOREM V 


If our congruential generator produces the sequence (Uj, Uy,...), and we 
look at the following sequence of points in n dimensions: 
Equation: 


(G5 Waal) (Us Ue Ure) i) 0s Ul eca os 


then the points will all lie in fewer than (n|m)*/ " parallel hyper planes. 


PROOF 


Note: Marsaglia (1976). 


Given these known limitations of congruential generator, we are still left 
with the question of how to choose the “best” values for a, c, and m. To do 
this, researchers have followed a straightforward but time-consuming 
procedure: 


1. Take values a, c, and m that give a sufficiently long, known cycle 
length and usa the generator to produce sequences of uniform variates. 

2. Subject the output sequences to batteries of statistical tests for 
independence and a uniform marginal distribution. Document the 
results. 

3. Subject the generator to theoretical tests. In particular, the spectral test 
of Coveyou and MacPherson (1967) is currently widely used and 
recognized as a very sensitive structural test for distinguishing between 
good and bad generators. Document the results. 

4. As new, more sensitive tests appear, subject to generator to those tests. 
Several such tests are discussed in Marsaglia(1985). 


Note: Other Types of Generators 


PSEUDO-RANDOM VARIABLE GENERATORS, cont. 
PSEUDO-RANDOM VARIABLE GENERATORS, cont. 


A Shift-Register Generator 


An alternative class of pseudo-numbers generators are shift-register or 
Tausworthe generators, which have their origins in the work of Golomb 
(1967). These algorithms operate on n-bit, pseudo-random binary vectors, 
just as congruential generators operate on pseudo-random integers. To 
return a uniform variate, the binary vector must be converted to an 
integer and divided by one plus the largest possible number, 


Fibonacci Generators 


The final major class of generators to be considered are the lagged 
Fibonacci generators, which take their name from the famous Fibonacci 
sequence . This recursion is reminiscent of the 
congruential generators, which the added feature that the current value 
depends on the two previous values. 


The integer generator based directly on the Fibonacci formula 
Equation: 


has been investigated, but not found to be satisfactory random. A more 
general formulation can be given by the equation: 
Equation: 


where the symbol ‘square’ represents an arbitrary mathematical operation. 
We can think of the as either binary vectors, integers, or real 


numbers between 0 and 1, depending on the operation involved. 
As examples: 


1. The are real and dot represents either mod 1 addition or 
subtraction. 

2. The are —bit integers and dot represents either mod 
addition, subtraction or multiplication. 

3. The are binary vectors and dot represents any of binary 
addition, binary subtraction, exclusive-or addition, or multiplication. 


Other generators that generalize even further on the Fibonacci idea by using 
a linear combination of previous random integers to generate the current 
random integer are discussed in Knuth (1981, Chap 3.2.2). 


Combinations of Generators (Shuffling) 


Intuitively, it is tempting to believe that “combining” two sequences of 
pseudo-random variables will produce one sequence with better uniformity 
and randomness properties than either of the two originals. In fact, even 
though good congruential, Tausworthe, and Fibonacci generators exist, 
combination generators may be better for a number of reasons. The 
individual generators with short cycle length can be combined intone with a 
very long cycle. This can be a great advantage, especially on computers 
with limited mathematical precision. These potential advantages have led to 
the development of a number of successful combination generators and 
research into many others. 


One of such generator, is a combination of three congruential generators, 
developed and tested by Wichmann and Hill (1982). 


Another generator, Super-Duper, developed by G.Marsaglia, combines the 
binary form of the output form the multiplicative congruenatial generator 
with a multiplier a=69.069 and modulus with the output of the 32- 
bit Tausworthe generator using a left-shift of 17 and a right shift of 15. This 
generator performs well, though not perfectly, and suffers from some 
practical drawbacks. 


A third general variation, a shuffled generator, randomizes the order in 
which a generator’s variates are output. Specifically, we consider one 
pseudo-random variate generator that produces the sequence of 
uniform (0,1) variates, and a second generator that outputs random integers 
, say between 1 and 16. 

The algorithm for the combined, shuffled generator is as follows: 


1. Set up a “table” in memory of locations 1 through 16 and store the 
values sequentially in the table. 

2. Generate one value, V, between 1 and 16 from the second generator. 

3. Return the U variate from location V in the table as the desired output 
pseudo-random variate. 

4. Generate a new U variate and store it in the location V that was just 
accessed. 

5. If more random variates are desired, return to Step 2. 


Note: the size of the table can be any value, with larger tables creating 
more randomness but requiring more memory allocation 


This method of shuffling by randomly accessing and filling a table is due to 
MacLaren and Marsaglia (1965). Another scheme, attributed to 
M.Gentlemanin Andrews et al. (1972), is to permute the table of 128 
random numbers before returning them for use. The use of this type of 
combination of generators has also been described in the contexts of 
simulation problems in physics by Binder and Stauffer (1984). 


THE IVERSE PROBABILITY METHOD FOR GENERATING RANDOM 
VARIABLES 


THE IVERSE PROBABILITY METHOD FOR GENERATING 
RANDOM VARIABLES 


Once the generation of the uniform random variable is established, it can be 
used to generate other types of random variables. 


The Continuous Case 
THEOREM I 


Let X have a continuous distribution F'y x ,sothat Ff, a exists for 
Qa (and is hopefully countable). Then the random variable FP’, U 
has distribution F’x «x , U is uniformly distributed on (0,1). 


PROOF 
Equation: 


PFy U « P Fy Fy U Fx & 


Because F'y x is monotone. Thus, 
Equation: 


P Fy U x PU Fx z Fy 2 


The last step follows because U is uniformly distributed on (0,1). 
Diagrammatically, we have that X « ifandonlyif U Fy x ,an 
event of probability F'y z . 


As long as we can invert the distribution function F’x zx to get the inverse 
distribution function F'y a , the theorem assures us we can start with a 


pseudo-random uniform variable U and turn into a random variable Fy U , 
which has the required distribution F’x z . 


Example: 

The Exponential Distribution 

Consider the exponential distribution defined as 
Equation: 


e AT Xv r 


L 


a Fy z 


Then f or the inverse distribution function we have 
Equation: 


xr — a FF a 


X 


Thus if U is uniformly distributed on 0 to 1, then _X = U has the 


distribution of an exponential random variable with parameter A. We say, for 
convenience, that X is exponential (A). 


Note: If U is uniform (0,1), then so is (1-U), and the pair U and (1-U) are 
interchangeable in terms of distribution. Hence, X nae is 


exponential. However, the two variables X and X’ are correlated and are 
known as an antithetic pair. 


Example: 

Normal and Gamma Distributions 

For both these cases there is no simple functional form for the inverse 
distribution F’, a , but because of the importance of the Normal and 


Gamma distribution models, a great deal of effort has been expended in 
deriving good approximations. 
The Normal distribution is defined through its density, 


Equation: 
cp 
tx 2 —- 
To O 
So that, 
Equation: 
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The normal distribution function Fy x is also often denoted @ x , when 
the parameter u and o are set to 0 to 1, respectively. The distribution has no 


closed-form inverse, F’y a , but the inverse is needed do often that 
, like logarithms or exponentials, is a system function. 

The inverse of the Gamma distribution function, which is given by 
Equation: 


kx wu 


Fy zx —— v' e *duz k U 


Is more difficult to compute because its shape changes radically with the 
value of k. It is however available on most computers as a numerically 
reliable function. 


Example: 
The Normal and Gamma Distributions 


Qa 


A commonly used symmetric distribution, which has a shape very much like 


that of the Normal distribution, is the standardized logistic distribution. 
Equation: 


oe sn a ae x 
with probability density function 
Equation: 
e” 
Fy « ee IEE x 
Note: Fx e e and F’x by using the 


second form for Fx zx. 
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And the random variable is generated, using the inverse probability integral 
method. As follows X U (BE 


The Discrete Case 


Let X have a discrete distribution F’y x thatis, Fx x jumps at points 
Lk . Usually we have the case that x, ~—‘k, so that X is an integer 
value. 


Let the probability function be denoted by 
Equation: 


Dk PX Lk k 


The probability distribution function is then, 
Equation: 


Bye a2 PX Lk pj k 
jk 


and the reliability or survivor function is 
Equation: 


Rx & Fy xz PX «a,k 


The survivor function is sometimes easier to work with than the distribution 
function, and in fields such as reliability, it is habitually used. The inverse 
probability integral transform method of generating discrete random variables 
is based on the following theorem. 


THEOREM 


Let U be uniformly distributed in the interval (0,1). Set X 2, whenever 
Fx xz U Fy ax, ,fork with Fy 2x . Then X 
has probability function p,x. 


PROOF 


By definition of the procedure, 


X «,ifandonlyif Fy zx; UO Py op. 

Therefore, 

Equation: 

PX Lk PFy Lk U Fy “Lk Fy Lk F Lk Pk 


By the definition of the distribution function of a uniform (0,1) random 
variable. 


Thus the inverse probability integral transform algorithm for generating X is to 
finda, suchthatU Fy x2, andU Fy x, ~ andthenset X= <x. 


In the discrete case, there is never any problem of numerically computing the 
inverse distribution function, but the search to find the values F'y x; and 
Fy x, between which U lies can be time-consuming, generally, 
sophisticated search procedures are required. In implementing this procedure, 
we try to minimize the number of times one compares U to F’x xp, . If we 
want to generate many of X,and Fy ~ «xx isnot easily computable, we may 
also want to store F’y x, for all k rather than recomputed it. Then we 
have to worry about minimizing the total memory to store values of 

F X “Lk. 


Example: 

The Binary Random Variable 

To generate a binary-valued random variable X that is 1 with probability p 
and 0 with probability 1-p, the algorithm is: 


e IfU  p, set X=1. 
e Else set X=0. 


Example: 

The Discrete Uniform Random Variable 

Let X take on integer values between and including the integers a and b, 
where a _ 6, with equal probabilities. Since there are 6 a distinct 
values for X, the probability of getting any one of these values is, by 
definition, boa . If we start with a continuous uniform (0,1) random 
number U, then the discrete inverse probability integral transform shows that 
X= integer partof b a Uae 


Note: The continuous random variable 6 a Ua is uniformly 
distributed in the open interval a b 


Example: 

The Geometric Distribution 

Let X take values on zero and the positive integers with a geometric 
distribution. Thus, 


Equation: 

PX k pp pp’ k p 
and 
Equation: 

PX k Fxk oe 1s p 


To generate geometrically distributed random variables then, you can proceed 
successively according to the following algorithm: 


¢ Compute F'y p. Generate U. 
eIfU Fy — set X=0 and exit. 
¢ Otherwise compute F'x pe 


elfU Fy set X=1, and exit. 
¢ Otherwise compute F'y —__, andso on. 


